2023:Labeling Behavior (Behavior)

From Grooper Wiki
Revision as of 16:25, 2 October 2023 by Rpatton (talk | contribs)

WIP

This article is a work-in-progress or created as a placeholder for testing purposes. This article is subject to change and/or expansion. It may be incomplete, inaccurate, or stop abruptly.

This tag will be removed upon draft completion.


The Labeling Behavior is a Content Type Behavior designed to collect and utilize a document's field labels in a variety of ways. This includes functionality for classification and data extraction.

Previous Versions

Grooper 2021

The Labeling Behavior functionality allows Grooper users to quickly onboard new Document Types for structured and semi-structured forms, utilizing labels as a thumbprint for classification and data extraction purposes. Once the Labeling Behavior is enabled, labels are identified and collected using the "Labels" tab of Document Types. These "Label Sets" can then be used for the following purposes:

  • Document classification - Using the Labelset-Based Classification Method
  • Field based data extraction - Primarily using the Labeled Value Extractor Type
  • Tabular data extraction - Primarily using a Data Table object's Tabular Layout Extract Method
  • Sectional data extraction - Primarily using a Data Section object's Transaction Detection Extract Method
FYI The Labeling Behavior and its functionality discussed in this article are often referred to as "Label Set Behavior" or simply "Label Sets".


About

Labels serve an important function on documents. They give the reader critical context to understand where data is located and what it means. How do you know the difference between the date on an invoice document indicating when the invoice was sent and the date indicating when you should pay the invoice? It's the labels. The labels are what distinguishes one type of date from another. For example, "Invoice Date" for the date the invoice was sent and "Due Date" for the date you need to pay by.

Labels can be a way of classifying documents as well. What does one individual label tell you about a document? Well, maybe not much. However, if you take them all together, they can tell you quite a bit about the kind of document you're looking at. For example, a W-4 employee withholding form is going to use different labels than an employee healthcare enrollment form. These are two very different documents collecting very different information. The labels used to collect this information are thus different as well.

Furthermore, you can even tell the difference between two very closely related documents using labels as well. For example, two different invoices from two different vendors may share some similarity in the labels they use to detail information. But there will be some differences as well. These differences can be useful identifiers to distinguish one from the other. Put all together, labels can act as a thumbprint Grooper can use to classify a document as one Document Type or another.

Even though these two invoices share some labels (highlighted in blue), there are others that are unique to each one (highlighted in yellow). This awareness of how one kind of invoice from one vendor uses labels differently from another can give you a method of classifying these documents using their label sets.


The Labeling Behavior is built on these concepts, collecting and utilizing labels for Document Types in a Content Model for classification and data extraction purposes.

As a Behavior, the Labeling Behavior is enabled on a Content Type object in Grooper.

While you can enable Labeling Behavior on any Content Type, in almost all cases, you will want to enable this Behavior on the Content Model.

Typically, you want to collect and use label sets for multiple Document Types in the Content Model, not just one Document Type individually. Enabling the Behavior on the Content Model will enable the Labeling Behavior for all child Document Types, allowing you to collect and utilize labels for all Document Types.

  1. Here, we have selected a Content Model in the Node Tree.
  2. To add a Behavior, select the Behaviors property and press the ellipsis button at the end.

  1. This will bring up a dialogue window to add various behaviors to the Content Model, including the Labeling Behavior
  2. Add the Labeling Behavior using the "Add" button.
  3. Select Labeling Behavior from the listed options.

  1. Once added, you will see a Labeling Behavior item added to the Behaviors list.
  2. Selecting the Labeling Behavior in the list, you will see property configuration options in the right panel.
    • The configuration options in the property panel pertain to fuzzy matching collected labels as well as constrained and vertical wrapping capabilities to target stacked labels.
    • By default, Grooper presumes you will want to use some fuzzy matching and enable constrained and vertical wrapping. These defaults work well for most use cases. However, you can adjust these properties here as needed.
  3. Press the "OK" button to finish adding the Labeling Behavior and exit this window.

  1. Now on the Content Model tab you should see a Behavior is set.
  2. Save your changes

Once the Labeling Behavior is enabled, the next big step is collecting label sets for the various Document Types in your Content Model.

  1. With the Labeling Behavior enabled, you will now see a "Labels" tab present for the Content Model.
    • This tab is also now present for each individual Document Type as well.
  2. Label sets are collected in this tab for each Document Type in the Content Model.

Each Document Type has its own set of labels used to define information on the document. For example, the "Factura" Document Type in this Content Model uses the label "PO Number" to call out the purchase order number on this invoice document. A different Document Type, corresponding to a different invoice format, might use a different label such as "Purchase Order Number" or "PO #".

  1. Ultimately, this is the data we want to collect using the Content Model's Data Model.
  2. We use the "Labels" tab to collect labels corresponding to the various Data Elements (Data Fields, Data Tables, and Data Sections) of the Data Model.
    • This provides a user interface to enter a label identifying the value you wish to collect for the Data Elements.
  3. For example, the label "PO Number" identifies the purchase order number for this invoice.
  4. Therefore, the label "PO Number" is collected for the "Purchase Order Number" Data Field in the Data Model.

For more information on collecting label sets for the Document Types in your Content Model see the How To section of this article.

In Grooper 2023, for labels to show up in the Labels tab of the Content Model, a Label Set aware Value Extractor (such as Labeled Value or Tabular Layout) must be set on your Data Fields and Data Tables.

Once label sets are collected for each Document Type, they can be used for classification and data extraction purposes.

For example, labels were used in this case to:

  1. Classify the documents, assigning each document the appropriate Document Type.
  2. Extract all the Data Fields seen here, collecting field based data from the document.

  1. Extract the "Line Items" Data Table, collecting the tabular data seen here.

For more information on how to use labels for these purposes, see the How To section of this article.

How To

The Labeling Behavior (often referred to as "Label Set Behavior" or just "Label Sets") are well suited for structured and semi-structured document sets. Label Sets are particularly useful for situations where you have multiple variations for one kind of document or another. While the information you want to extract from the document set may be the same from variation to variation, how the data is laid out and labeled may be very different from one variation of the document to another. Label Sets allow you to quickly onboard new Document Types to capture new form structures.

We will use invoices for the document set in the following tutorials.

In a perfect world, you'd create a Content Model with a single "Invoice" Document Type whose Data Model would successfully extract all Data Elements for all invoices from all vendors every time no matter what.

This is often not the case. You may find you need to add multiple Document Types to account for variations of an invoice from multiple vendors. Label Sets give you a method of quickly adding to Document Types to model new variations. In our case, we will presume we need to create one Document Type for each vendor.

We will start with five Document Types for invoices from five vendors.

  • Factura
  • Lasku
  • Envoy
  • Rechnung
  • Arve

You may download and import the file below into your own Grooper environment (version 2021). This contains the Batch(es) with the example document(s) discussed in this tutorial and the Content Model(s) configured according to the instructions.

Collect Label Sets

Navigate to the Labels UI

Collecting labels for the Document Types in your Content Model will be the first thing you want to do after enabling the Labeling Behavior. Labels for each Data Element in the Document Type's Data Model are defined using the "Labels" tab of the Content Model.

  1. Select the desired Content Model.
  2. Navigate to the "Labels" tab.
  3. With a Batch selected in the "Test Batch" window panel, select a document folder.

  1. Right click on the folder.
  2. Click "Assign Document Type.."

  1. When the "Assign Document Type" window pops up, click the hamburger button on the right side of the Content Type property.
  2. From the drop down, find the Content Model you are using and select the correct Document Type for the document.

  1. Click "Execute" to assign the Document Type.

FYI If you haven't added a Document Type for the selected document folder yet, you can click the "+" (plus sign) button at the top of the Labels UI to both create a Document Type and assign it to the document that is currently selected.

  1. Now that the Document Type has been set on the document, we have a couple of elements showing in our "LABELS" UI. However, we are not seeing any Data Fields or the Data Table objects.
  2. The eye icon at the top of the "Labels" tab hides all fields that do not have a label set on them by default. Click the eye button to show these fields.

  1. Now all objects are showing in the "LABELS" UI.

Click me to return to the top

Collect Field Labels

Now that this document has been classified (assigned a Document Type from our Content Model), we can collect labels for its Document Type. This can be done in one of three ways:

  1. Lassoing the label in the "Document Viewer".
  2. Double-clicking the label in the Document Viewer.
  3. Typing the label in manually.
Going forward, this tutorial presumes you have obtained machine readable text from these documents, either OCR'd text or native text, via the Recognize activity.

Generally the quickest way is by simply lassoing the label in the "Document Viewer".

  1. Select the Data Element whose label you wish to collect.
    • Here, we are selecting the "Invoice Number" Data Field.
  2. Click the "Rubberband Label" button.
  3. With your cursor, lasso around the text label on the document.

  1. Upon lassoing the label in the Document Viewer, the OCR'd or native text behind the selected region will be used to populate the Data Element's label.
    • At this point, the label for the "Invoice Number" Data Field is now "Invoice Number" because that's the text data we selected. Whatever text characters you lasso with your cursor will be assigned as the label.

If you choose, you may also manually enter a label for a Data Element by simply typing it into the text box.

  1. Here we've selected the "Purchase Order Number" Data Field and entered "PO Number".
  2. This will correspond to the label "PO Number" on the document itself.

  1. If you type in the label incorrectly with typos and the label is unable to match anything on the document, an error icon will appear next to the label rather than a check mark.

An error icon will appear any time whatever is in the label field does not match with any OCRed data on the document. This can be due to a typo on the label or OCR error.

  1. Continue lassoing or manually entering labels until all are collected.
  2. Next, we will focus on collecting labels from tables and table columns (the Data Table and Data Column elements in a Data Model). The process is essentially the same, but bears some extra explanation.

Click me to return to the top

Collect Table and Column Labels

Table and column labels can be used for tabular data extraction as well, setting a Data Table object to use the Tabular Layout Extract Method.

When collecting labels for this method of table extraction, keep in mind you must collect the individual column headers, and may optionally collect both the full row of column header labels as well.

While it is optional, it is generally regarded as best practice to capture the full row of column header labels. This will generally increase the accuracy of your column label extraction. We will do both in this tutorial.

  1. We will collect the full row of column header labels for the Data Table object's label.
  2. We will collect each individual column header label for each individual Data Column object's label.

This may seem like you are duplicating your efforts but it is often critical to do both in order for the Tabular Layout Extract Method to map the table's structure and ultimately collect the table's data.

  • In particular if you are dealing with OCR text data containing inaccurate character recognition data, establishing the full header row for the table will boost the fuzzy matching capabilities of the Labeling Behavior.

  1. To collect the Data Table's label, select the Data Table object in the Labels tab.
    • Here, we've selected the Data Table named "Line Items".
  2. Lasso the entire header row for the table.
    • You may notice there are more columns on this table than we are collecting. As it is on the document, the table has six columns. But we're only collecting four, the "Quantity", "Description", "Unit Price", and "Line Total" Data Columns.
    • Generally, you should collect the whole row of column headers, even if there are extra columns whose data you are not collecting.

  1. Next, collect each child Data Column's header label.
    • Here, we've selected the "Quantity" Data Column.
  2. Lasso the individual column header for the selected Data Column.
    • Here, the stacked label, "Qty. Ord.".

  1. Continue collecting labels for the remaining Data Columns.
  2. We have four Data Columns for this Data Table. Therefore, we collect four header labels from the document.

Click me to return to the top

Auto Map Labels

As you add labels for each Document Type, you may find some documents have labels in common. For example, there are only so many ways to label an invoice number. It might be "Invoice Number", "Invoice No", "Invoice #" or even just "Invoice". Some invoices are going to use one label, others another.

When collecting labels for multiple Document Types you can use the "Auto Map" feature to automatically add labels you've previously collected on another Document Type.

  1. So far, we've only collected labels for one Document Type, the "Factura" Document Type.
  2. Now, we're collecting labels for the "Lasku" Document Type.
  3. Press the "Auto Map" button to automatically assign previously collected labels,

Grooper will search the document's text for labels matching those previously collected on other Document Types.

  1. For example, we collected the label "Remit To:" for the "Remit Address" Data Field for the "Factura" Document Type. The "Auto Map" feature found a match for this label on the document and assigned the "Lasku" Document Type's "Remit Address" Data Field the same label.

If a match is not found, the Data Element's label is left blank.

  1. For example, the label for the "Invoice Amount" Data Field for the "Factura" Document Type was "Amount due".
  2. This label was nowhere to be found on this document. The invoice amount is labeled "Total" on the "Lasku" documents. So, the label is left blank for you to collect.

As you keep collecting labels for more and more Document Types, the Auto Map feature will pick up more and more labels, allowing you to quickly onboard new Document Types.

Be aware, you may still need to validate the auto mapped values and make adjustments.

  1. For example, the label "Date" is very generic.
  2. This label does actually correspond to the invoice date on the "Lasku" Document Type in this case.
  3. However, that could label some other date on another Document Type. Even on this document, the label "Date" is returning the "Date" portion of "Ship Date" and other instances where "Date" is found in the text.
    • As a side note, there are ways to make simple labels like "Date" more specific to the data they pertain to using "Custom Labels". More on that in the next tab.
  4. You can also make minor adjustments to the mapped labels.
    • The mapped label for the "Purchase Order Number" Data Field was "PO Number" (as it was collected for the "Factura" Document Type), but it is more specifically "PO Number:" on the "Lasku" documents. We can just add the colon at the end of the label manually.

Click me to return to the top

Collect Custom Labels

It's important to keep in mind labels are collected for corresponding Data Elements in a Data Model. You collect one label per Data Element (Data Field, Data Section, Data Table or Data Column). What if you want to collect a label that is distinct from a Data Element, one that doesn't necessarily have to do with a value collected by your Data Model? And why would you even want to?

That's what "Custom Labels" are for. Custom labels serve two primary functions:

  1. Providing additional labels for classification purposes.
  2. Providing context labels when a Data Element's label matches multiple points on a document

Custom Labels may only be added to Data Model, Data Section or Data Table objects' labels. Put another way, any Data Element in the Data Model's hierarchy that can have child Data Elements can have custom labels.

When used for classification purposes, custom labels are typically added to the Data Model itself.

  1. First select the Data Element in the Data Model's hierarchy to which you wish to add the label.
    • In our case, we're selecting the Data Model itself.
  2. Click the "Add Label" icon at the top of the "Label" tab.

  1. When the small window pops up, enter a name for your custom label and click "Add Custom".

  1. This will add a new label tab, named whatever you named it in the previous step.
    • Here, we just named it "Custom 01".
  2. Collect the custom label by either lassoing the text using the Document Viewer or manually typing in the label.
    • For example, the word "Invoice" might be a useful label for classification purposes. This label isn't used to collect anything in our Data Model, but might be helpful to identify this and other invoices from the Factura Technology Corp as "Factura" Document Types. Collecting the label "Invoice" as a Custom Label will allow us to use it as a feature of this Document Type for classification.

  1. Now check your custom label for accuracy and OCR errors.

You may add more Custom Labels to the selected Data Element by repeating the process described above.

  1. Click the "Add Labels" icon again.
  2. Name your label and click "Add Custom".

Custom Labels as Context Labels

Some labels are more specific than others. The label "Invoice Date" is more specific than the label "Date". If you see the label "Invoice Date" you know the date you're looking at is the date the invoice was generated. The label "Date" may refer to the invoice's generation date or it could be part of another label like "Due Date". However, some invoice formats will label the invoice date as simply "Date".

  1. For example, the label "Date" on this "Factura" Document Type does indeed correspond to the invoice date for the "Invoice Date" Data Field.
  2. However, this label pops up as part of other labels too, such as the "Date" in "Due Date" or "Order Date".

This can present a challenge for data extraction. The possibilities for false-positive results tend to crop up the more generic the label used to identify a desired value. There are three separate date values identified by the word "Date" (in full or in part) on this document.

This is the second reason Custom Labels are typically added for a Document Type, to provide extra context for generic labels, especially when they produce multiple results on a document, leading to false-positive data extraction.

There are two steps to adding and using a Custom Label for this purpose:

  1. Add the Custom Label.
  2. Marry the Custom Label with the Data Element's label.

We will refer to this type of a Custom Label as a "Context Label" from here out.

The only "trick" to this is adding the Context Label to the appropriate level of the Data Model's hierarchy.

Remember, a Custom Label may only be added to a Data Model, Data Section or Data Table object. We cannot add a Custom Label to a Data Field, such as the "Invoice Number" Data Field.

To add a Context Label a Data Field can use, we must add the Custom Label to its direct parent Data Element.

  1. In the case of the "Invoice Date" Data Field its direct parent Data Element is the Data Model itself.

  1. Click the "Add Label" icon at the top of the "Labels" tab.
  2. Type in a name for your label and click "Add Custom"

  1. We are going to lasso these two labels "Date" and "Page" together to make the custom label.

  1. We can see that the label was collected appropriately.
    • This custom label will provide the simple label "Date" some extra context.
    • Which of the three results for the label "Date" do we want to accept? The one falling within this zone.

Now that we've added the label, we need to marry the Custom Label with the Data Field its giving extra context to. This is done with the Parent property of a Data Field label.

  1. Click on the check mark icon next to the "Invoice Date" Data Field label.

  1. Click on the hamburger icon next to the Parent property.
  2. Using the drop down list, select the Custom Label you wish to use for the Context Label.

  1. Click "SAVE" to save your changes.

  1. Notice with this Context Label added we only return a single result for the "Invoice Date" Data Field's label "Date". This is the label we want to associate with this Data Field.
  2. The other two results do not fall within the Context Label, and are no longer returned.

Click me to return to the top


Label Sets & Classification

About Labelset-Based Classification

Label Sets can be used for classifying documents using the Labelset-Based Classification Method. For structured and semi-structured forms labels end up being a way of identifying a document. Without the field data entered, the labels are really what define the document. You know what kind of document you're looking at based on what kind of information is presented and in the case of Labelset-Based classification how that data is labeled. Even when those labels are very similar from one variant to the next, they end up being a thumbprint of that variant. For example, you might use Labelset-Based classification to create Document Types for different variations of invoices from different vendors. The information presented on each variant from each vendor will be more or less the same, and some labels will be more commonly used by different vendors (such as "Invoice Number"). However, if there is enough variation in the set of labels, you can easily differentiate an invoice from one vendor verses another just based on the variation in labels.

Take these four "documents". Each one is collecting the same information:

  • A person's name
  • Their social security number
  • Their birthday
  • Their phone number
  • Their address

So we might have five Data Fields in our Data Model, one for each piece of information. We'd also collect one label for each Data Field as well.

While the data we want from these documents is the same, there is some variation in the labels used for each different document type. If we wanted to distinguish these four documents from each other by classifying using the Labelset-Based Classification Method. This is all done measuring the similarity between the collected label sets for each Document Type.


How is Document Type "B" different from Document Type "A"?

  • It uses the label SSN: instead of Social Security Number:.

How is Document Type "C" different from Document Type "A"?

  • It uses the labels SSN: instead of Social Security Number: and DOB: instead of Date of Birth:.

How is Document Type "D" different from Document Type "A"?

  • It uses the labels SSN: instead of Social Security Number:, DOB: instead of Date of Birth:, and Phone #: instead of Phone Number.

Using the Labelset-Based Classification Method unclassified documents are classified by assigning the document the Document Type whose labels are most similar. The basic concept is "similarity" is determined by how many labels are shared between the unclassified document and the label sets collected for the Document Types in your Content Model. The unclassified document is assigned the Document Type with the highest degree of similarity between matched labels and the Document Types' label sets.

The similarity calculation is very straightforward. Grooper searches for labels collected for every Document Type and measures the total character difference between all the labels matched on the document.

If each of these five labels is collected for each Document Type's Label Set, you'd have the following character totals for the set.

  • Document Type "A" - 63 total label characters.
  • Document Type "B" - 44 total label characters.
  • Document Type "C" - 34 total label characters.
  • Document Type "D" - 29 total label characters.

How similar is Document Type "A" to Document Type "B"?

  • "A" uses the label Social Security Number: instead of SSN:
  • However, there is a match for the remaining four labels.
  • The remaining four labels, Name:, Date of Birth, Phone Number: and Address: are comprised of 40 characters.
  • The similarity score is the percentage of matched label characters divided by the total characters in the Document Type's label set.
    • 40 matched label characters / 44 total label characters = 0.9091
    • "A" is roughly 91% similar to "B"

How similar is Document Type "A" to Document Type "C"?

  • "A" uses the label Social Security Number: instead of SSN: and Date of Birth instead of DOB:
  • However, there is a match for the remaining three labels.
  • The remaining three labels, Name:, Phone Number: and Address: are comprised of 26 characters.
  • The similarity score is the percentage of matched label characters divided by the total characters in the Document Type's label set.
    • 26 matched label characters / 34 total label characters = 0.7647
    • "A" is roughly 76% similar to "B"

How similar is Document Type "A" to Document Type "D"?

  • Figure out what labels from "A" match "D", and do the math.

If we ran one of these "documents" into Grooper, we can see these results very clearly.

  1. The document shares all five labels in common with the "A" Document Type.
  2. Grooper searches for labels matching the label sets for all Document Types in the Content Model and creates a similarity score for each one.
    • You can see the math described above play out here. Matching all labels in the "A" Document Type's label set, the document is considered 100% similar. Less so for the other Document Types because while they share some labels (like Name:), some are different (like Social Security Number: versus SSN:
  3. Upon classification, the document folder is assigned the Document Type with the highest similarity score.
    • In this case the "A" Document Type.


Configuring Labelset-Based Classification

Next, we will walk through the steps required to enable and configure the Labelset-Based Classification Method, using our example set of invoice documents.

The basic steps are as follows:

  1. Set the Content Model's Classification Method property to Labelset-Based
  2. Collect labels for each Document Type
  3. Test classification
  4. Reconfigure, updating existing Document Types' Label Sets and adding new Document Types as needed.

Assign the Labelset-Based Classification Method

Once you've figured out you want to use Label Sets to classify your documents, you need to tell your Content Model that's what you want to do! This is done by setting the Content Model's Classification Method property to Labelset-Based.

  1. Select a Content Model in the Node Tree.
    • We've selected the "Labeling Behavior - Invoices" Content Model we've been working with in this How To section of the article.
  2. Select the Classification Method property.
  3. Using the dropdown menu, select Labelset-Based from the list of options.

Next, we will collect labels for each Document Type in the Content Model.

  1. Note we've already added a Labeling Behavior to the Behaviors property.
    • It doesn't matter whether you add a Labeling Behavior and/or collect labels before selecting Labelset-Based for the Classification Method' or after.
    • However, you will need to add the Labeling Behavior at some point in order to collect label sets for the Document Types and ultimately use the Labelset-Based method for document classification. Visit the tutorial above if you're unsure how to add the Labeling Behavior to the Content Model.

Click me to return to the top

Collect Labels

See the above how to (Collect Label Sets) for a full explanation of how to collect labels for Document Types in a Content Model. The rest of this tutorial will presume you have general familiarity with collecting labels.

  1. Switch to the "Labels" tab.
  2. Collect labels for each Data Element in the Document Type's Data Model.
  3. Collect labels for each Document Type in the Data Model.

Click me to return to the top

Test Classification

In general, regardless of the Classification Method used, one of three things is going to happen to Batch Folders in a Batch during classification.

  1. The folder will be assigned the correct Document Type.
  2. The folder will be assigned the wrong Document Type.
  3. The folder will be assigned no Document Type at all.

The Labelset-Based method is no different. If all folders are classified correctly, that's great. However, testing is all about ensuring this is the case and figuring out where and why problems arise when folders are classified wrong or not classified at all.

We will look at a couple examples of how classification can go wrong using the Labelset-Based method, why that is the case, and what to do about it.

FYI

The example Batch in the rest of this tutorial is purposefully small to illustrate a few key points. In the real world, you will want to test using a much larger batch with several examples of each Document Type.

  1. In Grooper 2023 to test Classification, you will need to create a Batch Process and add a "Classify" Batch Process Step.
  2. Make sure your Content Model Scope is set to the appropriate Content Model. For this example, we are using the "Labelset Classification - Invoices - Model" for classification.

  1. Next, go to the "Classification Tester" tab.
  2. Select all of the documents in your batch that you wish to classify.
  3. Right click on the documents, select "Classification" and then click on "Classify".

Now we just need to evaluate the success or failure of our classification. Let's look at a few documents in our Batch before detailing what we will do to resolve any classification errors.

  1. This is a complete success!
    • The Batch Folder has been assigned the "Factura" Document Type.
  2. It indeed should have been classified so. It is an invoice from the Factura Technology Corp.
  3. Its similarity score is 100% similar to the "Factura" Document Type.
    • This means a match has been found for all labels in the "Factura" Document Type's label set.

  1. This is a mitigated success.
    • The Batch Folder has been assigned the "Envoy" Document Type.
  2. It indeed should have been classified so. It is an invoice from Envoy Imaging Systems.
  3. However, it's a mitigated success in that its similarity score is only 84%.
    • That means only 84% of the labels located on this document match the label set for the "Envoy" Document Type.
  4. In this case, this is due to poor OCR data. While some labels may be present on the document, their OCR data is too garbled to match the label in the label set.
    • For example, the label Invoice was not matched because the text was OCR'd as "nvoice".
    • But a win is a win! Part of the reason Labelset-Based can be an effective classification method is you can miss a few labels due to poor OCR and still end up classifying the document appropriately. It is the set as a whole which determines similarity. As long as the document is more similar to the correct Document Type than any of the other Document Types, Grooper has made the right classification decision.

  1. This is a mitigated failure.
    • The Batch Folder should have been assigned the "Envoy" Document Type but it was unclassified.
  2. This is due to its similarity to the "Envoy" Document Type's Label Set falling bellow 60%.
    • 60% is the default Minimum Similarity for this Content Model. If a Batch Folder fails to achieve a similarity score above 60%, it will remain unclassified, as is the case here.
    • But that's so close! It just fell short in terms of similarity between matched labels and the "Envoy" Document Type's Label Set.
  3. In this case, several of the labels for the Data Elements of our Data Model are smudged on the document. OCR was unable to return these portions of the document. Therefore, the label's were not matched.
  4. Remember we collect one label per Data Element. However, there's all kinds of labels on this document for data we don't necessarily care about. Do we have a Data Field for the "Salesperson ID" field on this invoice? No, it's not data we're choosing to collect.
    • But just because we don't have a Data Field for it doesn't mean it's not a useful label for classification. We will look at how to create custom labels for classification purposes in the next section, Common Problems and Solutions.

  1. This is also a mitigated failure.
    • The Batch Folder should have been assigned a "Stuff and Things" Document Type but it was unclassified.
  2. This is a variation of an invoice from the vendor "Stuff and Things"
  3. You may notice the "Stuff and Things" Document Type does not appear at any similarity in our similarity list.
  4. That's because there isn't a "Stuff and Things" Document Type yet. We need to add one and collect labels for it.
    • This is fairly common with a Labelset-Based approach to classification (and indeed the use of Label Sets in general). It often has its most utility in situations where you have a lot of variants of one particular kind of document. The general idea is to use Label Sets to distinguish between the variants by creating one Document Type for each variant, each with their own unique Label Set.
    • Such is the case with invoices. There's lots of different invoice formats, often unique to each vendor. When you get one in a Batch you haven't seen before, you will need to add a new Document Type to account for the new variant. However, as we will see in the next section, onboarding new Document Types with Label Sets is relatively quick and painless.

  1. This is a more severe version of the failure seen in the previous example.
    • The Batch Folder should have been assigned a "Standard" Document Type but it was assigned the wrong Document Type, the "Rechnung" Document Type.
  2. However, we don't have a "Standard" Document Type yet. Just like the previous example, we will need to add one and collect labels for it.
  3. The only think we will need to watch out for is making sure once we do add a Document Type for the invoices from Standard Products, it classifies more confidently than the "Rechnung" Document Type, beating out its similarity score and receiving the "Standard" Document Type.

  1. This is a complete failure.
    • The Batch Folder should have been assigned the "Envoy" Document Type but it was unclassified.
  2. The document is of poor enough quality to get near unusable OCR results.
  3. This resulted in a paltry similarity score of 49%.

What can we do about this?

Sometimes you have to know when to stop. Will it be worth it to reconfigure your Content Model and Label Sets to force Grooper to classify this document in one way or another? Probably not. This is more likely than not an extreme outlier, not representative of the larger document set. It may be easier to kick this document (and other outliers) out to human review, especially if reconfiguring the Content Model is going to negatively impact results in other ways.

You have to know when to leave well enough alone. Outliers like this are a good example of when to do just that.

Click me to return to the top


Common Problems and Solutions

Custom Labels to Boost Similarity

  1. In the above tutorial, we saw this document failed to classify correctly.
    • The Batch Folder should have been assigned the "Envoy" Document Type but it was unclassified.
  2. This is due to its similarity to the "Envoy" Document Type's Label Set falling bellow 60%.
    • 60% is the default Minimum Similarity for this Content Model. If a Batch Folder fails to achieve a similarity score above 60%, it will remain unclassified, as is the case here.
    • But that's so close! It just fell short in terms of similarity between matched labels and the "Envoy" Document Type's Label Set.
  3. In this case, several of the labels for the Data Elements of our Data Model are smudged on the document. OCR was unable to return these portions of the document. Therefore, the label's were not matched.
  4. Remember we collect one label per Data Element. However, there's all kinds of labels on this document for data we don't necessarily care about. Do we have a Data Field for the "Salesperson ID" field on this invoice? No, it's not data we're choosing to collect.

Just because we don't have a Data Field for it doesn't mean it's not a useful label for classification. Even though we don't need to extract the salesperson's identification number, the fact that label "Salesperson ID" is present on these invoices could be important. It's another feature that makes up the "Envoy" Document Type. We just need a way of telling Grooper to use this label for classification, even though we can ignore it when it comes time to extract data from these documents.

That is one of the reasons for adding custom labels to a Document Type's Label Set.

  1. To add a custom label, first navigate to the "Labels" tab of the Content Model.
  2. Either:
    1. Select a document folder in the Batch selector of the desired Document Type.
    2. Or assign a Document Type by right clicking the document folder and going through the process of assigning the correct Document Type.
      • In our case we want to add a custom label to the "Envoy" Document Type. We have selected the document folder in the Batch and assigned it the "Envoy" Document Type.

  1. Make sure the document folder is selected.
  2. Select a Data Element from the Data Model to which you wish to add the custom label.
    • Most commonly, when adding a custom label for classification purposes, you'll just want to add it to the Data Model root itself, as we've selected here.
  3. Click the "Add Label" icon.
  4. Give your custom label a name and click "Add Custom".

  1. Adding the custom label will add a new label tab named whatever you named it.
    • In this case "Salesperson ID".
  2. Using the text editor, collect the label (either typing it in or lassoing or double clicking it in the document viewer).
    • In this case, Salesperson ID

Now that this label is in the Label Set, it will be considered a label during classification. The label's there. It's part of the document, whether we're extracting the value or not. We "tell" Grooper labels like these should be considered features for classification by creating custom labels.

FYI You can add as many custom labels as you want.

Indeed, you may want multiple custom labels, adding more label features that distinguish one Document Type to another. To add multiple custom labels, just repeat the process described above, right-clicking the label tabs and adding a new custom label for each label you want to collect.

When we re-classify this Batch, we will see some different results.

  1. Navigate to the "Classify" Batch Process Step in your node tree and re-classify the documents.
  2. Click on the "Classification Tester" tab.
  3. Notice this document now classifies correctly as an "Envoy" Document Type!
  4. Before we added the custom label, this only achieved a similarity score of 59%, falling short of the 60% minimum similarity threshold. Now, it scores a 63% similarity.
    • With another label added to the Label Set, there's more context to what comprises this Document Type.
    • And that's with just one custom label added. There are tons more labels we could collect as custom labels on the document, likely further increasing the similarity score.

Click me to return to the top

Adding New Document Types

The Labelset-Based classification method makes some assumptions about your document processing approach. It shines with structured and semi-structured forms. Labels, more or less, "stay put" on these kinds of documents. You'll see the same field labels over and over again even though the field values will change from document to document. This presumes your Document Types will be very regular (or rigid, with one Label Set very specifically corresponding to one Document Type). If you encounter a new form or variant of an existing form, you likely will need to account for it with a new Document Type.

  1. Such is the case for this document we encountered in the previous tutorial.
  2. The document is unclassified because it doesn't match any of the Label Sets for the existing Document Types.
    • More specifically, its similarity score to the existing Document Types does not meet the 60% minimum similarity threshold for this Content Model.
  3. This should be a "Stuff and Things" Document Type, but we don't have one yet. We need to add it and collect its Label Set to correctly classify the document.

Luckily, the process of adding new Document Types and defining their label sets is quick and painless and actually can become easier the more Document Types you add to the Content Model.

You can do the whole thing in the "Labels" tab of the Content Model.

  1. Navigate to the "Labels" tab in the Content Model.
  2. Select the unclassified document folder for which you want to create a new Document Type.
  3. Click the "+" (plus sign) button at the top.

  1. When the "Add Document Type" window pops up, name the Document Type whatever you like.
    • In our case we named it "Stuff and Things", for the invoice we want the Document Type to apply to.
  2. Click the "OK" button to finish and add the Document Type.

  1. This will add the Document Type to the Content Model
  2. It will also assign the Document Type to the selected document folder in the test Batch.
  3. Collect labels for the document as discussed in the Collect Labels section of this article.

That's it! You've added a new Document Type and collected its Label Set.

  • Keep in mind, as you add new Document Types to the Content Model you will want to perform regression testing to ensure your classification model is still accurate.

Click me to return to the top

Document Types Sharing Similar Labels

As you keep adding more and more Document Types to the Content Model, you will inevitably keep adding more and more labels for the Data Elements in your Data Model. Eventually, you will come across a new document variant that shares a lot of similarity with an already existing Document Type.

  1. Such was the case with these three documents. They were confidently classified as "Rechnung" Document Types.
  2. Their similarity is 99%.
  3. However, these aren't invoices from the vendor Rechnung, they are from the vendor Standard Products.
    • They simply share a lot of the same labels. Interestingly, this "problem" is actually going to end up making our job even easier when adding the new Document Type.

This is where the label auto-map functionality comes in handy.

  1. Add the new Document Type
  2. Assign the right document folder (whose labels you want to collect) the new Document Type.
  3. Click the "Auto-Map" button.
  4. In this case, Grooper found all of the labels present on another document except for the "Line Items" Data Model header label. This is the only different label between the Standard Products and Rechnung invoices.

  1. Click the "Rubberband Label" button at the top.
  2. Making sure your cursor is on the "Line Items" label, go ahead and draw a box around the header label for this table on the document.
  3. The header label should be collected for the Data Table object.

  1. If we go back and reclassify the documents now, we encounter an issue. The three Standard Products invoices aren't being classified at all!
  2. This is because this invoice is coming in at a 99% for both Rechnung and Standard Document Types.

You might be wondering, if we collected the header label for the Data Table on the Standard Products invoice, and that was for sure different than Rechnung, why are they coming up with the same similarity score?

  1. Go to the Data Table object in the node tree.
  2. Make sure you're on the "Data Table" tab.
  3. Expand out the Row Count Range property.
  4. Change the Minimum to 1.

With the Minimum Row Count Range set to "(none)" or zero, we were essentially telling Grooper that there may or may not be a table on the document. A table cannot have zero rows. Since we're telling Grooper that the table is not reliable to the Document Type it will not use it for classification purposes.

If we set the Minimum to 1, we are telling Grooper to expect a table on every document. Now that it is a reliable indicator of a Document Type Grooper will use the table (and table headers) in classification.

If you have multiple Data Table objects in your Data Model, you will need to repeat these steps for each one.

  1. If we go reclassify the documents again, we can see that the documents are now being classified appropriately as the "Standard" Document Type.
  2. The similarities are no longer the same and have a wide enough range that there is no confusion on how this document should be classified.

Click me to return to the top

Volatile Labels

Sometimes, you will collect a label you do not want to use for classification purposes. Most often, this is because the label may or may not be present depending on the document.

For example, some of these invoices from Standard Products have the sales tax totaled on the document. However, some do not.

This is called a "Volatile" label. Its presence on a document is unpredictable. Sometimes it's there. Sometimes it's not. It's an optional piece of information. However, because it's optional (or "volatile") we don't actually want to include this as a label for classification. It's going to decrease the similarity score for documents who do not contain the label.

  1. For example, the selected document here does not have the tax listed on the document.
  2. Since that label is not present, its similarity is lower than if it were present.
    • It drops from 100% to 98% in this case. Now, this may not be a critical drop in similarity for this case, but very well could be for others depending on their OCR quality or presence of multiple volatile labels.

You can indicate these kinds of labels are "volatile" and should not be considered for classification. Whether it's there or not, Grooper will not include it as a feature to measure the similarity between an unclassified document and the Document Type.

  1. To do this, navigate to the "Labels" tab of the Content Model.
  2. Click on the icon to the left of the collected label on the Data Element whose label you wish to turn volatile.
    • In our case, we wish to make the "Tax" Data Field's label volatile. As we've seen, sometimes its present on the document and sometimes it's not.

  1. When the "Label Properties" window pops up, change the Volatile property from False to True.
  2. Click "SAVE" to save your changes.

  1. Now, when we classify this document folder...
  2. ...even though the sales tax label is not present on the document...
  3. ...its similarity is 100%!
    • With the label, Tax set as a volatile label, it is no longer considered during the similarity calculation. With it missing from the document, it no longer negatively impacts the similarity score.

Click me to return to the top