2023:Labeling Behavior (Behavior): Difference between revisions
No edit summary |
No edit summary |
||
| Line 710: | Line 710: | ||
|valign=top style="width:40%"| | |valign=top style="width:40%"| | ||
# In Grooper 2023 to test Classification, you will need to create a '''Batch Process''' and add a "Classify" '''Batch Process Step'''. | # In Grooper 2023 to test Classification, you will need to create a '''Batch Process''' and add a "Classify" '''Batch Process Step'''. | ||
# Make sure your '''''Content Model Scope''''' is set to the appropriate '''Content Model'''. | # Make sure your '''''Content Model Scope''''' is set to the appropriate '''Content Model'''. For this example, we are using the "Labelset Classification - Invoices - Model" for classification. | ||
| | | | ||
[[File:2023-Labeling Behavior-Test Classification 01.png]] | [[File:2023-Labeling Behavior-Test Classification 01.png]] | ||
| Line 730: | Line 730: | ||
#* This means a match has been found for all labels in the "Factura" '''Document Type's''' label set. | #* This means a match has been found for all labels in the "Factura" '''Document Type's''' label set. | ||
|valign=top| | |valign=top| | ||
[[File:Labeling- | [[File:2023-Labeling Behavior-Test Classification 03.png]] | ||
|- | |- | ||
|valign=top| | |valign=top| | ||
| Line 736: | Line 736: | ||
#* The '''Batch Folder''' has been assigned the "Envoy" '''Document Type'''. | #* The '''Batch Folder''' has been assigned the "Envoy" '''Document Type'''. | ||
# It indeed should have been classified so. It is an invoice from Envoy Imaging Systems. | # It indeed should have been classified so. It is an invoice from Envoy Imaging Systems. | ||
# However, it's a mitigated success in that its similarity score is only | # However, it's a mitigated success in that its similarity score is only 84%. | ||
#* That means only | #* That means only 84% of the labels located on this document match the label set for the "Envoy" '''Document Type'''. | ||
# In this case, this is due to poor OCR data. While some labels may be present on the document, their OCR data is too garbled to match the label in the label set. | # In this case, this is due to poor OCR data. While some labels may be present on the document, their OCR data is too garbled to match the label in the label set. | ||
#* For example, the label <code>Invoice</code> was not matched because the text was OCR'd as "nvoice". | #* For example, the label <code>Invoice</code> was not matched because the text was OCR'd as "nvoice". | ||
#* But a win is a win! Part of the reason ''Labelset-Based'' can be an effective classification method is you can miss a few labels due to poor OCR and still end up classifying the document appropriately. It is the ''set as a whole'' which determines similarity. As long as the document is more similar to the correct '''Document Type''' than any of the other '''Document Types''', '''Grooper''' has made the right classification decision. | #* But a win is a win! Part of the reason ''Labelset-Based'' can be an effective classification method is you can miss a few labels due to poor OCR and still end up classifying the document appropriately. It is the ''set as a whole'' which determines similarity. As long as the document is more similar to the correct '''Document Type''' than any of the other '''Document Types''', '''Grooper''' has made the right classification decision. | ||
|valign=top| | |valign=top| | ||
[[File:Labeling- | [[File:2023-Labeling Behavior-Test Classification 04.png]] | ||
|- | |- | ||
|valign=top| | |valign=top| | ||
| Line 754: | Line 754: | ||
#* But just because we don't have a '''Data Field''' for it doesn't mean it's not a useful label for classification. We will look at how to create custom labels for classification purposes in the next section, [[#Common Problems and Solutions|Common Problems and Solutions]]. | #* But just because we don't have a '''Data Field''' for it doesn't mean it's not a useful label for classification. We will look at how to create custom labels for classification purposes in the next section, [[#Common Problems and Solutions|Common Problems and Solutions]]. | ||
|valign=top| | |valign=top| | ||
[[File:Labeling- | [[File:2023-Labeling Behavior-Test Classification 05.png]] | ||
|- | |- | ||
|valign=top| | |valign=top| | ||
| Line 765: | Line 765: | ||
#* Such is the case with invoices. There's lots of different invoice formats, often unique to each vendor. When you get one in a '''Batch''' you haven't seen before, you will need to add a new '''Document Type''' to account for the new variant. However, as we will see in the [[#Common Problems and Solutions|next section]], onboarding new '''Document Types''' with Label Sets is relatively quick and painless. | #* Such is the case with invoices. There's lots of different invoice formats, often unique to each vendor. When you get one in a '''Batch''' you haven't seen before, you will need to add a new '''Document Type''' to account for the new variant. However, as we will see in the [[#Common Problems and Solutions|next section]], onboarding new '''Document Types''' with Label Sets is relatively quick and painless. | ||
|valign=top| | |valign=top| | ||
[[File:Labeling- | [[File:2023-Labeling Behavior-Test Classification 06.png]] | ||
|- | |- | ||
|valign=top| | |valign=top| | ||
| Line 773: | Line 773: | ||
# The only think we will need to watch out for is making sure once we do add a '''Document Type''' for the invoices from Standard Products, it classifies ''more'' confidently than the "Rechnung" '''Document Type''', beating out its similarity score and receiving the "Standard" '''Document Type'''. | # The only think we will need to watch out for is making sure once we do add a '''Document Type''' for the invoices from Standard Products, it classifies ''more'' confidently than the "Rechnung" '''Document Type''', beating out its similarity score and receiving the "Standard" '''Document Type'''. | ||
|valign=top| | |valign=top| | ||
[[File:Labeling- | [[File:2023-Labeling Behavior-Test Classification 07.png]] | ||
|- | |- | ||
|valign=top| | |valign=top| | ||
| Line 779: | Line 779: | ||
#* The '''Batch Folder''' ''should'' have been assigned the "Envoy" '''Document Type''' but it was unclassified. | #* The '''Batch Folder''' ''should'' have been assigned the "Envoy" '''Document Type''' but it was unclassified. | ||
# The document is of poor enough quality to get near unusable OCR results. | # The document is of poor enough quality to get near unusable OCR results. | ||
# This resulted in a paltry similarity score of | # This resulted in a paltry similarity score of 49%. | ||
What can we do about this? | What can we do about this? | ||
| Line 787: | Line 787: | ||
You have to know when to leave well enough alone. Outliers like this are a good example of when to do just that. | You have to know when to leave well enough alone. Outliers like this are a good example of when to do just that. | ||
|valign=top| | |valign=top| | ||
[[File:Labeling- | [[File:2023-Labeling Behavior-Test Classification 08.png]] | ||
|} | |} | ||
Revision as of 09:54, 29 September 2023
|
WIP |
This article is a work-in-progress or created as a placeholder for testing purposes. This article is subject to change and/or expansion. It may be incomplete, inaccurate, or stop abruptly. This tag will be removed upon draft completion. |
The Labeling Behavior is a Content Type Behavior designed to collect and utilize a document's field labels in a variety of ways. This includes functionality for classification and data extraction.
| Previous Versions |
|---|
The Labeling Behavior functionality allows Grooper users to quickly onboard new Document Types for structured and semi-structured forms, utilizing labels as a thumbprint for classification and data extraction purposes. Once the Labeling Behavior is enabled, labels are identified and collected using the "Labels" tab of Document Types. These "Label Sets" can then be used for the following purposes:
- Document classification - Using the Labelset-Based Classification Method
- Field based data extraction - Primarily using the Labeled Value Extractor Type
- Tabular data extraction - Primarily using a Data Table object's Tabular Layout Extract Method
- Sectional data extraction - Primarily using a Data Section object's Transaction Detection Extract Method
| FYI | The Labeling Behavior and its functionality discussed in this article are often referred to as "Label Set Behavior" or simply "Label Sets". |
About

Labels serve an important function on documents. They give the reader critical context to understand where data is located and what it means. How do you know the difference between the date on an invoice document indicating when the invoice was sent and the date indicating when you should pay the invoice? It's the labels. The labels are what distinguishes one type of date from another. For example, "Invoice Date" for the date the invoice was sent and "Due Date" for the date you need to pay by.
Labels can be a way of classifying documents as well. What does one individual label tell you about a document? Well, maybe not much. However, if you take them all together, they can tell you quite a bit about the kind of document you're looking at. For example, a W-4 employee withholding form is going to use different labels than an employee healthcare enrollment form. These are two very different documents collecting very different information. The labels used to collect this information are thus different as well.
Furthermore, you can even tell the difference between two very closely related documents using labels as well. For example, two different invoices from two different vendors may share some similarity in the labels they use to detail information. But there will be some differences as well. These differences can be useful identifiers to distinguish one from the other. Put all together, labels can act as a thumbprint Grooper can use to classify a document as one Document Type or another.
The Labeling Behavior is built on these concepts, collecting and utilizing labels for Document Types in a Content Model for classification and data extraction purposes.
|
As a Behavior, the Labeling Behavior is enabled on a Content Type object in Grooper.
|
|||
|
|||
|
|||
|
|||
|
Once the Labeling Behavior is enabled, the next big step is collecting label sets for the various Document Types in your Content Model.
Each Document Type has its own set of labels used to define information on the document. For example, the "Factura" Document Type in this Content Model uses the label "PO Number" to call out the purchase order number on this invoice document. A different Document Type, corresponding to a different invoice format, might use a different label such as "Purchase Order Number" or "PO #".
For more information on collecting label sets for the Document Types in your Content Model see the How To section of this article.
|
|||
|
Once label sets are collected for each Document Type, they can be used for classification and data extraction purposes. For example, labels were used in this case to:
|
|||
For more information on how to use labels for these purposes, see the How To section of this article. |
How To
The Labeling Behavior (often referred to as "Label Set Behavior" or just "Label Sets") are well suited for structured and semi-structured document sets. Label Sets are particularly useful for situations where you have multiple variations for one kind of document or another. While the information you want to extract from the document set may be the same from variation to variation, how the data is laid out and labeled may be very different from one variation of the document to another. Label Sets allow you to quickly onboard new Document Types to capture new form structures.
|
We will use invoices for the document set in the following tutorials. In a perfect world, you'd create a Content Model with a single "Invoice" Document Type whose Data Model would successfully extract all Data Elements for all invoices from all vendors every time no matter what. This is often not the case. You may find you need to add multiple Document Types to account for variations of an invoice from multiple vendors. Label Sets give you a method of quickly adding to Document Types to model new variations. In our case, we will presume we need to create one Document Type for each vendor. We will start with five Document Types for invoices from five vendors.
|
Collect Label Sets
|
Collecting labels for the Document Types in your Content Model will be the first thing you want to do after enabling the Labeling Behavior. Labels for each Data Element in the Document Type's Data Model are defined using the "Labels" tab of the Content Model.
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
Collect Field Labels
Now that this document has been classified (assigned a Document Type from our Content Model), we can collect labels for its Document Type. This can be done in one of three ways:
- Lassoing the label in the "Document Viewer".
- Double-clicking the label in the Document Viewer.
- Typing the label in manually.
| ‼ | Going forward, this tutorial presumes you have obtained machine readable text from these documents, either OCR'd text or native text, via the Recognize activity. |
|
Generally the quickest way is by simply lassoing the label in the "Document Viewer".
|
|||
|
|||
|
If you choose, you may also manually enter a label for a Data Element by simply typing it into the text box.
|
|||
|
|||
|
Collect Table and Column Labels
|
Table and column labels can be used for tabular data extraction as well, setting a Data Table object to use the Tabular Layout Extract Method. When collecting labels for this method of table extraction, keep in mind you must collect the individual column headers, and may optionally collect both the full row of column header labels as well. While it is optional, it is generally regarded as best practice to capture the full row of column header labels. This will generally increase the accuracy of your column label extraction. We will do both in this tutorial.
This may seem like you are duplicating your efforts but it is often critical to do both in order for the Tabular Layout Extract Method to map the table's structure and ultimately collect the table's data.
|
|
|
|
|
|
|
Auto Map Labels
|
As you add labels for each Document Type, you may find some documents have labels in common. For example, there are only so many ways to label an invoice number. It might be "Invoice Number", "Invoice No", "Invoice #" or even just "Invoice". Some invoices are going to use one label, others another. When collecting labels for multiple Document Types you can use the "Auto Map" feature to automatically add labels you've previously collected on another Document Type.
|
|
|
Grooper will search the document's text for labels matching those previously collected on other Document Types.
If a match is not found, the Data Element's label is left blank.
As you keep collecting labels for more and more Document Types, the Auto Map feature will pick up more and more labels, allowing you to quickly onboard new Document Types. |
|
|
Be aware, you may still need to validate the auto mapped values and make adjustments.
|
Collect Custom Labels
It's important to keep in mind labels are collected for corresponding Data Elements in a Data Model. You collect one label per Data Element (Data Field, Data Section, Data Table or Data Column). What if you want to collect a label that is distinct from a Data Element, one that doesn't necessarily have to do with a value collected by your Data Model? And why would you even want to?
That's what "Custom Labels" are for. Custom labels serve two primary functions:
- Providing additional labels for classification purposes.
- Providing context labels when a Data Element's label matches multiple points on a document
|
Custom Labels may only be added to Data Model, Data Section or Data Table objects' labels. Put another way, any Data Element in the Data Model's hierarchy that can have child Data Elements can have custom labels. When used for classification purposes, custom labels are typically added to the Data Model itself.
|
|
|
|
|
|
|
|
|
You may add more Custom Labels to the selected Data Element by repeating the process described above.
|
Custom Labels as Context Labels
|
Some labels are more specific than others. The label "Invoice Date" is more specific than the label "Date". If you see the label "Invoice Date" you know the date you're looking at is the date the invoice was generated. The label "Date" may refer to the invoice's generation date or it could be part of another label like "Due Date". However, some invoice formats will label the invoice date as simply "Date".
This can present a challenge for data extraction. The possibilities for false-positive results tend to crop up the more generic the label used to identify a desired value. There are three separate date values identified by the word "Date" (in full or in part) on this document. |
This is the second reason Custom Labels are typically added for a Document Type, to provide extra context for generic labels, especially when they produce multiple results on a document, leading to false-positive data extraction.
There are two steps to adding and using a Custom Label for this purpose:
- Add the Custom Label.
- Marry the Custom Label with the Data Element's label.
We will refer to this type of a Custom Label as a "Context Label" from here out.
|
The only "trick" to this is adding the Context Label to the appropriate level of the Data Model's hierarchy. Remember, a Custom Label may only be added to a Data Model, Data Section or Data Table object. We cannot add a Custom Label to a Data Field, such as the "Invoice Number" Data Field. To add a Context Label a Data Field can use, we must add the Custom Label to its direct parent Data Element.
|
|
|
|
|
|
|
|
|
Now that we've added the label, we need to marry the Custom Label with the Data Field its giving extra context to. This is done with the Parent property of a Data Field label.
|
|
|
|
|
|
|
Label Sets & Classification
About Labelset-Based Classification
Label Sets can be used for classifying documents using the Labelset-Based Classification Method. For structured and semi-structured forms labels end up being a way of identifying a document. Without the field data entered, the labels are really what define the document. You know what kind of document you're looking at based on what kind of information is presented and in the case of Labelset-Based classification how that data is labeled. Even when those labels are very similar from one variant to the next, they end up being a thumbprint of that variant. For example, you might use Labelset-Based classification to create Document Types for different variations of invoices from different vendors. The information presented on each variant from each vendor will be more or less the same, and some labels will be more commonly used by different vendors (such as "Invoice Number"). However, if there is enough variation in the set of labels, you can easily differentiate an invoice from one vendor verses another just based on the variation in labels.
|
Take these four "documents". Each one is collecting the same information:
So we might have five Data Fields in our Data Model, one for each piece of information. We'd also collect one label for each Data Field as well. While the data we want from these documents is the same, there is some variation in the labels used for each different document type. If we wanted to distinguish these four documents from each other by classifying using the Labelset-Based Classification Method. This is all done measuring the similarity between the collected label sets for each Document Type.
How is Document Type "C" different from Document Type "A"?
How is Document Type "D" different from Document Type "A"?
|
|
Using the Labelset-Based Classification Method unclassified documents are classified by assigning the document the Document Type whose labels are most similar. The basic concept is "similarity" is determined by how many labels are shared between the unclassified document and the label sets collected for the Document Types in your Content Model. The unclassified document is assigned the Document Type with the highest degree of similarity between matched labels and the Document Types' label sets.
|
The similarity calculation is very straightforward. Grooper searches for labels collected for every Document Type and measures the total character difference between all the labels matched on the document. If each of these five labels is collected for each Document Type's Label Set, you'd have the following character totals for the set.
How similar is Document Type "A" to Document Type "B"?
How similar is Document Type "A" to Document Type "C"?
How similar is Document Type "A" to Document Type "D"?
|
|
|
If we ran one of these "documents" into Grooper, we can see these results very clearly.
|
Configuring Labelset-Based Classification
Next, we will walk through the steps required to enable and configure the Labelset-Based Classification Method, using our example set of invoice documents.
The basic steps are as follows:
- Set the Content Model's Classification Method property to Labelset-Based
- Collect labels for each Document Type
- Test classification
- Reconfigure, updating existing Document Types' Label Sets and adding new Document Types as needed.
Assign the Labelset-Based Classification Method
|
Once you've figured out you want to use Label Sets to classify your documents, you need to tell your Content Model that's what you want to do! This is done by setting the Content Model's Classification Method property to Labelset-Based.
Next, we will collect labels for each Document Type in the Content Model.
|
Collect Labels
|
See the above how to (Collect Label Sets) for a full explanation of how to collect labels for Document Types in a Content Model. The rest of this tutorial will presume you have general familiarity with collecting labels.
|
| ⚠ |
|
Test Classification
In general, regardless of the Classification Method used, one of three things is going to happen to Batch Folders in a Batch during classification.
- The folder will be assigned the correct Document Type.
- The folder will be assigned the wrong Document Type.
- The folder will be assigned no Document Type at all.
The Labelset-Based method is no different. If all folders are classified correctly, that's great. However, testing is all about ensuring this is the case and figuring out where and why problems arise when folders are classified wrong or not classified at all.
We will look at a couple examples of how classification can go wrong using the Labelset-Based method, why that is the case, and what to do about it.
| FYI |
The example Batch in the rest of this tutorial is purposefully small to illustrate a few key points. In the real world, you will want to test using a much larger batch with several examples of each Document Type. |
|
|
|
|
|
Now we just need to evaluate the success or failure of our classification. Let's look at a few documents in our Batch before detailing what we will do to resolve any classification errors.
|
|
|
|
|
|
|
|
|
|
What can we do about this? Sometimes you have to know when to stop. Will it be worth it to reconfigure your Content Model and Label Sets to force Grooper to classify this document in one way or another? Probably not. This is more likely than not an extreme outlier, not representative of the larger document set. It may be easier to kick this document (and other outliers) out to human review, especially if reconfiguring the Content Model is going to negatively impact results in other ways. You have to know when to leave well enough alone. Outliers like this are a good example of when to do just that. |































































