2021:Labeling Behavior (Behavior): Difference between revisions
Dgreenwood (talk | contribs) No edit summary |
Dgreenwood (talk | contribs) No edit summary |
||
| Line 5: | Line 5: | ||
</blockquote> | </blockquote> | ||
The ''Labeling Behavior'' functionality allows Grooper users to quickly onboard new '''Document Types''' for structured and semi-structured forms, utilizing labels as a thumbprint for classification and data extraction purposes. Once the ''Labeling Behavior'' is enabled, labels are identified and collected using the "Labels" tab of '''Document Types'''. These "Label Sets" can then be used for the following purposes: | The '''''Labeling Behavior''''' functionality allows Grooper users to quickly onboard new '''Document Types''' for structured and semi-structured forms, utilizing labels as a thumbprint for classification and data extraction purposes. Once the '''''Labeling Behavior''''' is enabled, labels are identified and collected using the "Labels" tab of '''Document Types'''. These "Label Sets" can then be used for the following purposes: | ||
* Document classification - Using the '''''Labelset-Based''''' Classification Method | * Document classification - Using the '''''Labelset-Based''''' Classification Method | ||
| Line 47: | Line 47: | ||
The ''Labeling Behavior'' is built on these concepts, collecting and utilizing labels for '''Document Types''' in a '''Content Model''' for classification and data extraction purposes. | The '''''Labeling Behavior''''' is built on these concepts, collecting and utilizing labels for '''Document Types''' in a '''Content Model''' for classification and data extraction purposes. | ||
{|cellpadding=10 cellspacing=5 | {|cellpadding=10 cellspacing=5 | ||
|valign=top style="width:40%"| | |valign=top style="width:40%"| | ||
As a '''''Behavior''''', the ''Labeling Behavior'' is enabled on a '''Content Type''' object in '''Grooper'''. | As a '''''Behavior''''', the '''''Labeling Behavior''''' is enabled on a '''Content Type''' object in '''Grooper'''. | ||
{|cellpadding="10" cellspacing="5" | {|cellpadding="10" cellspacing="5" | ||
|-style="background-color:#f89420; color:white" | |-style="background-color:#f89420; color:white" | ||
|style="font-size:22pt"|'''⚠'''||While you ''can'' enable ''Labeling Behavior'' on any '''Content Type''', in almost all cases, you will want to enable this '''''Behavior''''' on the '''Content Model'''. | |style="font-size:22pt"|'''⚠'''||While you ''can'' enable '''''Labeling Behavior''''' on any '''Content Type''', in almost all cases, you will want to enable this '''''Behavior''''' on the '''Content Model'''. | ||
Typically, you want to collect and use label sets for multiple '''Document Types''' in the '''Content Model''', not just one '''Document Type''' individually. Enabling the '''''Behavior''''' on the '''Content Model''' will enable the ''Labeling Behavior'' for all child '''Document Types''', allowing you to collect and utilize labels for all '''Document Types'''. | Typically, you want to collect and use label sets for multiple '''Document Types''' in the '''Content Model''', not just one '''Document Type''' individually. Enabling the '''''Behavior''''' on the '''Content Model''' will enable the '''''Labeling Behavior''''' for all child '''Document Types''', allowing you to collect and utilize labels for all '''Document Types'''. | ||
|} | |} | ||
# Here, we have selected a '''Content Model''' in the Node Tree. | # Here, we have selected a '''Content Model''' in the Node Tree. | ||
# To add a '''''Behavior''''', select the '''''Behaviors''''' property and press the ellipsis button at the end. | # To add a '''''Behavior''''', select the '''''Behaviors''''' property and press the ellipsis button at the end. | ||
# This will bring up a dialogue window to add various behaviors to the '''Content Model''', including the ''Labeling Behavior'' | # This will bring up a dialogue window to add various behaviors to the '''Content Model''', including the '''''Labeling Behavior''''' | ||
# Add the ''Labeling Behavior'' using the "Add" button. | # Add the '''''Labeling Behavior''''' using the "Add" button. | ||
# Select ''Labeling Behavior'' from the listed options. | # Select ''Labeling Behavior'' from the listed options. | ||
| | | | ||
| Line 69: | Line 69: | ||
|- | |- | ||
|valign=top| | |valign=top| | ||
# Once added, you will see a ''Labeling Behavior'' item added to the '''''Behaviors''''' list. | # Once added, you will see a '''''Labeling Behavior''''' item added to the '''''Behaviors''''' list. | ||
# Selecting the ''Labeling Behavior'' in the list, you will see property configuration options in the right panel. | # Selecting the '''''Labeling Behavior''''' in the list, you will see property configuration options in the right panel. | ||
#* The configuration options in the property panel pertain to [[Fuzzy RegEx|fuzzy matching]] collected labels as well as [[Constrained Wrap|constrained]] and [[Vertical Wrap|vertical wrapping]] capabilities to target stacked labels. | #* The configuration options in the property panel pertain to [[Fuzzy RegEx|fuzzy matching]] collected labels as well as [[Constrained Wrap|constrained]] and [[Vertical Wrap|vertical wrapping]] capabilities to target stacked labels. | ||
#* By default, '''Grooper''' presumes you will want to use some fuzzy matching and enable constrained and vertical wrapping. These defaults work well for most use cases. However, you can adjust these properties here as needed. | #* By default, '''Grooper''' presumes you will want to use some fuzzy matching and enable constrained and vertical wrapping. These defaults work well for most use cases. However, you can adjust these properties here as needed. | ||
# Press the "OK" button to finish adding the ''Labeling Behavior'' and exit this window. | # Press the "OK" button to finish adding the '''''Labeling Behavior''''' and exit this window. | ||
| | | | ||
[[File:Labeling-behavior-about-05.png]] | [[File:Labeling-behavior-about-05.png]] | ||
|- | |- | ||
|valign=top| | |valign=top| | ||
Once the ''Labeling Behavior'' is enabled, the next big step is collecting label sets for the various '''Document Types''' in your '''Content Model'''. | Once the '''''Labeling Behavior''''' is enabled, the next big step is collecting label sets for the various '''Document Types''' in your '''Content Model'''. | ||
# With the ''Labeling Behavior'' enabled, you will now see a "Labels" tab present for the '''Content Model'''. | # With the '''''Labeling Behavior''''' enabled, you will now see a "Labels" tab present for the '''Content Model'''. | ||
#* This tab is also now present for each individual '''Document Type''' as well. | #* This tab is also now present for each individual '''Document Type''' as well. | ||
# Label sets are collected in this tab for each '''Document Type''' in the '''Content Model'''. | # Label sets are collected in this tab for each '''Document Type''' in the '''Content Model'''. | ||
| Line 112: | Line 112: | ||
== How To == | == How To == | ||
The ''Labeling Behavior'' (often referred to as "Label Set Behavior" or just "Label Sets") are well suited for structured and semi-structured document sets. Label Sets are particularly useful for situations where you have multiple variations for one kind of document or another. While the information you want to extract from the document set may be the same from variation to variation, how the data is laid out and labeled may be very different from one variation of the document to another. Label Sets allow you to quickly onboard new '''Document Types''' to capture new form structures. | The '''''Labeling Behavior''''' (often referred to as "Label Set Behavior" or just "Label Sets") are well suited for structured and semi-structured document sets. Label Sets are particularly useful for situations where you have multiple variations for one kind of document or another. While the information you want to extract from the document set may be the same from variation to variation, how the data is laid out and labeled may be very different from one variation of the document to another. Label Sets allow you to quickly onboard new '''Document Types''' to capture new form structures. | ||
{|cellpadding=10 cellspacing=5 | {|cellpadding=10 cellspacing=5 | ||
| Line 148: | Line 148: | ||
{|cellpadding=10 cellspacing=5 | {|cellpadding=10 cellspacing=5 | ||
|valign=top style="width:40%"| | |valign=top style="width:40%"| | ||
Collecting labels for the '''Document Types''' in your '''Content Model''' will be the first thing you want to do after enabling the ''Labeling Behavior''. Labels for each '''Data Element''' in the '''Document Type's''' '''Data Model''' are defined using the "Labels" tab of the '''Content Model'''. | Collecting labels for the '''Document Types''' in your '''Content Model''' will be the first thing you want to do after enabling the '''''Labeling Behavior'''''. Labels for each '''Data Element''' in the '''Document Type's''' '''Data Model''' are defined using the "Labels" tab of the '''Content Model'''. | ||
# Navigate to the "Labels" tab of the '''Content Model'''. | # Navigate to the "Labels" tab of the '''Content Model'''. | ||
| Line 255: | Line 255: | ||
This may seem like you are duplicating your efforts but it is often critical to do both in order for the ''Tabular Layout'' '''''Extract Method''''' to map the table's structure and ultimately collect the table's data. | This may seem like you are duplicating your efforts but it is often critical to do both in order for the ''Tabular Layout'' '''''Extract Method''''' to map the table's structure and ultimately collect the table's data. | ||
* In particular if you are dealing with OCR text data containing inaccurate character recognition data, establishing the full header row for the table will boost the [[Fuzzy RegEx|fuzzy matching]] capabilities of the ''Labeling Behavior''. | * In particular if you are dealing with OCR text data containing inaccurate character recognition data, establishing the full header row for the table will boost the [[Fuzzy RegEx|fuzzy matching]] capabilities of the '''''Labeling Behavior'''''. | ||
| | | | ||
[[File:Labeling-behavior-about-15.png]] | [[File:Labeling-behavior-about-15.png]] | ||
| Line 565: | Line 565: | ||
Next, we will collect labels for each '''Document Type''' in the '''Content Model'''. | Next, we will collect labels for each '''Document Type''' in the '''Content Model'''. | ||
#<li value=4> Note we've already added a ''Labeling Behavior'' to the '''''Behaviors''''' property. | #<li value=4> Note we've already added a '''''Labeling Behavior''''' to the '''''Behaviors''''' property. | ||
#* It doesn't matter whether you add a ''Labeling Behavior'' and/or collect labels before selecting ''Labelset-Based'' for the '''''Classification Method'''''' or after. | #* It doesn't matter whether you add a '''''Labeling Behavior''''' and/or collect labels before selecting ''Labelset-Based'' for the '''''Classification Method'''''' or after. | ||
#* However, you will need to add the ''Labeling Behavior'' at some point in order to collect label sets for the '''Document Types''' and ultimately use the ''Labelset-Based'' method for document classification. Visit the [[#Collect Label Sets|tutorial above]] if you're unsure how to add the ''Labeling Behavior'' to the '''Content Model'''. | #* However, you will need to add the '''''Labeling Behavior''''' at some point in order to collect label sets for the '''Document Types''' and ultimately use the '''''Labelset-Based''''' method for document classification. Visit the [[#Collect Label Sets|tutorial above]] if you're unsure how to add the '''''Labeling Behavior''''' to the '''Content Model'''. | ||
| | | | ||
[[File:Labeling-behavior-classification-how to-02.png]] | [[File:Labeling-behavior-classification-how to-02.png]] | ||
| Line 946: | Line 946: | ||
=== Intro to The Labeled Value Extractor === | === Intro to The Labeled Value Extractor === | ||
For most static field based extraction, the ''Labeling Behavior'' leverages the ''Labeled Value | For most static field based extraction, the '''''Labeling Behavior''''' leverages the '''''Labeled Value''''' Extractor Type. Let's first briefly examine how '''''Labeled Value''''' works outside of the '''''Labeling Behavior''''' functionality. | ||
As the name implies, ''Labeled Value'' extractor is designed to return labeled values. A common feature of structured forms is to divide information across a series of fields. But it's not as if you just have a bunch of data randomly strewn throughout the document. Typically, the field's value will be identified by some kind of label. These labels provide the critical context to what the data refers to. | As the name implies, '''''Labeled Value''''' extractor is designed to return labeled values. A common feature of structured forms is to divide information across a series of fields. But it's not as if you just have a bunch of data randomly strewn throughout the document. Typically, the field's value will be identified by some kind of label. These labels provide the critical context to what the data refers to. | ||
''Labeled Value'' relies on the spatial relationship between the label and the value. Most often labels and their corresponding values are aligned in one of two ways. | '''''Labeled Value''''' relies on the spatial relationship between the label and the value. Most often labels and their corresponding values are aligned in one of two ways. | ||
{|cellpadding=10 cellspacing=5 style="margin:20px" | {|cellpadding=10 cellspacing=5 style="margin:20px" | ||
| | | | ||
| Line 963: | Line 963: | ||
|} | |} | ||
''Labeled Value'' uses two extractors itself, one to find the label and another for the value. If the two extractors results are aligned horizontally or vertically within a certain amount of space (according to how the ''Labeled Value'' extractor is configured), the value's result is returned. | '''''Labeled Value''''' uses two extractors itself, one to find the label and another for the value. If the two extractors results are aligned horizontally or vertically within a certain amount of space (according to how the '''''Labeled Value''''' extractor is configured), the value's result is returned. | ||
{|cellpadding=10 cellpadding=5 | {|cellpadding=10 cellpadding=5 | ||
|valign=top style="width:40%"| | |valign=top style="width:40%"| | ||
# For example, we could configure this "Invoice Number" '''Data Field''' to utilize the ''Labeled Value'' extractor to return the invoice number on the document. | # For example, we could configure this "Invoice Number" '''Data Field''' to utilize the '''''Labeled Value''''' extractor to return the invoice number on the document. | ||
#* Keep in mind this is the "hard" way of doing things. As we will see, the ''Labeling Behavior'' will make this process easier. | #* Keep in mind this is the "hard" way of doing things. As we will see, the '''''Labeling Behavior''''' will make this process easier. | ||
# We've set the '''''Value Extractor''''' to ''Labeled Value'' | # We've set the '''''Value Extractor''''' to ''Labeled Value'' | ||
# The label is returned by the '''''Label Extractor''''' | # The label is returned by the '''''Label Extractor''''' | ||
| Line 983: | Line 983: | ||
|} | |} | ||
However, the ''Labeled Value'' extractor's set up is a little different when combining it with the ''Labeling Behavior''. The end result is a simpler configuration, utilizing collected labels for the '''''Label Extractor'''''. | However, the '''''Labeled Value''''' extractor's set up is a little different when combining it with the '''''Labeling Behavior'''''. The end result is a simpler configuration, utilizing collected labels for the '''''Label Extractor'''''. | ||
[[#Label Sets and the Labeled Value Extractor Type|Click me to return to the top]] | [[#Label Sets and the Labeled Value Extractor Type|Click me to return to the top]] | ||
| Line 992: | Line 992: | ||
{|cellpadding=10 cellpadding=5 | {|cellpadding=10 cellpadding=5 | ||
|valign=top style="width:40%"| | |valign=top style="width:40%"| | ||
Since this '''Content Model''' utilizes the ''Labeling Behavior'', at least part of the setup described in the previous tab was unnecessary. If you've collected a label for the '''Data Field''' and that '''Data Field's''' '''''Value Extractor''''' is set to ''Labeled Value'', there is no need to configure a '''''Label Extractor'''''. Instead, '''Grooper''' will pass through the collected label to the ''Labeled Value'' extractor. | Since this '''Content Model''' utilizes the '''''Labeling Behavior''''', at least part of the setup described in the previous tab was unnecessary. If you've collected a label for the '''Data Field''' and that '''Data Field's''' '''''Value Extractor''''' is set to ''Labeled Value'', there is no need to configure a '''''Label Extractor'''''. Instead, '''Grooper''' will pass through the collected label to the '''''Labeled Value''''' extractor. | ||
# For example, we've already collected a label for the "Invoice Number" '''Data Field''' for the "Factura" '''Document Type'''. | # For example, we've already collected a label for the "Invoice Number" '''Data Field''' for the "Factura" '''Document Type'''. | ||
| Line 1,006: | Line 1,006: | ||
#* All that was required, in this case was to collect the label and set the '''Data Field's''' '''''Value Extractor''''' property to ''Labeled Value''. Magic! | #* All that was required, in this case was to collect the label and set the '''Data Field's''' '''''Value Extractor''''' property to ''Labeled Value''. Magic! | ||
#* Not magic. Label sets. | #* Not magic. Label sets. | ||
# With ''Labeling Behavior'' enabled and a label collected for the "Invoice Number" '''Data Field''', the ''Labeled Value'' extractor's '''''Label Extractor''''' looks for a match for the collected label. | # With '''''Labeling Behavior''''' enabled and a label collected for the "Invoice Number" '''Data Field''', the '''''Labeled Value''''' extractor's '''''Label Extractor''''' looks for a match for the collected label. | ||
#* In this case <code>Invoice Number</code>. | #* In this case <code>Invoice Number</code>. | ||
# Furthermore, with ''Labeling Behavior'' enabled and a collected label utilized as the '''''Label Extractor''''', the ''Labeled Value'' extractor's '''''Value Extractor''''' will still return a value even if left unconfigured. | # Furthermore, with '''''Labeling Behavior''''' enabled and a collected label utilized as the '''''Label Extractor''''', the '''''Labeled Value''''' extractor's '''''Value Extractor''''' will still return a value even if left unconfigured. | ||
#* It will look for the nearest simple segment according to the layout settings (the '''''Maximum Distance''''' and '''''Maximum Noise''''' property). | #* It will look for the nearest simple segment according to the layout settings (the '''''Maximum Distance''''' and '''''Maximum Noise''''' property). | ||
#* The result "IN165796" is indeed the nearest simple segment and the desired result. So, there is technically nothing else we need to do. However, situations are rarely this simple and straightforward. There are some other considerations we should keep in mind. | #* The result "IN165796" is indeed the nearest simple segment and the desired result. So, there is technically nothing else we need to do. However, situations are rarely this simple and straightforward. There are some other considerations we should keep in mind. | ||
| Line 1,019: | Line 1,019: | ||
|style="font-size:22pt"|'''⚠''' | |style="font-size:22pt"|'''⚠''' | ||
| | | | ||
While you ''can'' get a result without configuring the ''Labeled Value'' extractor's '''''Value Extractor''''', that doesn't mean you ''should''. | While you ''can'' get a result without configuring the '''''Labeled Value''''' extractor's '''''Value Extractor''''', that doesn't mean you ''should''. | ||
It is considered best practice to ''always'' configure the '''''Value Extractor'''''. | It is considered best practice to ''always'' configure the '''''Value Extractor'''''. | ||
| Line 1,029: | Line 1,029: | ||
=== Best Practice Considerations === | === Best Practice Considerations === | ||
While you ''can'' get a result without configuring the ''Labeled Value'' extractor's '''''Value Extractor''''', that doesn't mean you ''should''. It is considered best practice to ''always'' configure the '''''Value Extractor'''''. | While you ''can'' get a result without configuring the '''''Labeled Value''''' extractor's '''''Value Extractor''''', that doesn't mean you ''should''. It is considered best practice to ''always'' configure the '''''Value Extractor'''''. | ||
So, why is it considered best practice to do so. The short answer is to increase the accuracy of your data extraction. A simple segment could be anything. If you know the data you're trying to extract has a certain pattern to it, you should target that data according to its pattern. Dates, for example, follow a few different patterns. Maybe it's "07/20/1969" or "07-20-69" or "July 20, 1969", but you know it's a date because it has a specific syntax or pattern to it. To increase the accuracy of your extraction, you should configure the '''''Value Reader''''' with an extractor that returns the kind of data you're attempting to return. | So, why is it considered best practice to do so. The short answer is to increase the accuracy of your data extraction. A simple segment could be anything. If you know the data you're trying to extract has a certain pattern to it, you should target that data according to its pattern. Dates, for example, follow a few different patterns. Maybe it's "07/20/1969" or "07-20-69" or "July 20, 1969", but you know it's a date because it has a specific syntax or pattern to it. To increase the accuracy of your extraction, you should configure the '''''Value Reader''''' with an extractor that returns the kind of data you're attempting to return. | ||
| Line 1,035: | Line 1,035: | ||
{|cellpadding=10 cellpadding=5 | {|cellpadding=10 cellpadding=5 | ||
|valign=top style="width:40%"| | |valign=top style="width:40%"| | ||
We can see fairly quickly why leaving the ''Labeled Value'' extractor's '''''Value Extractor''''' unconfigured is not ideal. | We can see fairly quickly why leaving the '''''Labeled Value''''' extractor's '''''Value Extractor''''' unconfigured is not ideal. | ||
# All the '''Data Fields''' in this '''Data Section''' have collected labels and are using the ''Labeled Value'' extractor. | # All the '''Data Fields''' in this '''Data Section''' have collected labels and are using the '''''Labeled Value''''' extractor. | ||
#* Except the "Vendor Name" '''Data Field'''. Ignore this '''Data Field''' for the time being. | #* Except the "Vendor Name" '''Data Field'''. Ignore this '''Data Field''' for the time being. | ||
# We only get a few accurate results. | # We only get a few accurate results. | ||
#* Without its '''''Value Extractor''''' configured, the ''Labeled Value'' extractor is going to grab whatever segment it can get. While it ''can'' be what you want, it is not ''necessarily'' what you want. | #* Without its '''''Value Extractor''''' configured, the ''''Labeled Value''''' extractor is going to grab whatever segment it can get. While it ''can'' be what you want, it is not ''necessarily'' what you want. | ||
#** The '''''Value Extractor''''' will allow you to target more specifically what you want to return. | #** The '''''Value Extractor''''' will allow you to target more specifically what you want to return. | ||
#* Furthermore, while the "Sales Tax" and "Invoice Amount" results may ''look'' accurate, they too are not. There are some OCR errors. The extracted segments "0,00" and "54.594.00" should be returned as "0.00" and "54,594.00". | #* Furthermore, while the "Sales Tax" and "Invoice Amount" results may ''look'' accurate, they too are not. There are some OCR errors. The extracted segments "0,00" and "54.594.00" should be returned as "0.00" and "54,594.00". | ||
#** The '''''Value Extractor''''' will also allow you to utilize ''Fuzzy RegEx'', '''Lexicon''' lookups, output formatting, '''Data Type''' '''''Collation''''' methods and other extractor functionalities to manipulate, format, and filter results. | #** The '''''Value Extractor''''' will also allow you to utilize ''Fuzzy RegEx'', '''Lexicon''' lookups, output formatting, '''Data Type''' '''''Collation''''' methods and other extractor functionalities to manipulate, format, and filter results. | ||
# For example, the "Date" '''Data Field''' returns the segment "Page" to the right of the label <code>Date</code> where it should be returning the date below it, "Feb 26, 2014". | # For example, the "Date" '''Data Field''' returns the segment "Page" to the right of the label <code>Date</code> where it should be returning the date below it, "Feb 26, 2014". | ||
#* If we were instead to configure the ''Labeled Value'' extractor's '''''Value Extractor''''' to only return dates, we'd get the more ''specific'' result we want and not the ''generic'' segment we don't. | #* If we were instead to configure the '''''Labeled Value''''' extractor's '''''Value Extractor''''' to only return dates, we'd get the more ''specific'' result we want and not the ''generic'' segment we don't. | ||
#* FYI: ''When the '''Value Extractor''' property is left unconfigured in this manner'', the ''Labeled Value'' extractor follows a "horizontal then vertical" order of operations. If both a '''''Right''''' '''''Maximum Distance''''' and a '''''Bottom''''' '''''Maximum Distance''''' are configured, it will look for results to the right of the label (aligned horizontally) before looking for results below the label (aligned vertically). | #* FYI: ''When the '''Value Extractor''' property is left unconfigured in this manner'', the '''''Labeled Value''''' extractor follows a "horizontal then vertical" order of operations. If both a '''''Right''''' '''''Maximum Distance''''' and a '''''Bottom''''' '''''Maximum Distance''''' are configured, it will look for results to the right of the label (aligned horizontally) before looking for results below the label (aligned vertically). | ||
|valign=top| | |valign=top| | ||
[[File:Labeling-behavior-how-to-field-extraction-05.png]] | [[File:Labeling-behavior-how-to-field-extraction-05.png]] | ||
| Line 1,052: | Line 1,052: | ||
|valign=top| | |valign=top| | ||
# If we reconfigure this "Invoice Date" '''Data Field''' slightly we will get a much more accurate result. | # If we reconfigure this "Invoice Date" '''Data Field''' slightly we will get a much more accurate result. | ||
# We've kept the '''Data Field's''' '''''Value Extractor''''' set to ''Labeled Value''. | # We've kept the '''Data Field's''' '''''Value Extractor''''' set to '''''Labeled Value'''''. | ||
# The only thing we've changed is we've set the ''Labeled Value'' extractor's '''''Value Extractor''''' to a ''Reference'' extractor pointing to a '''Data Type''' returning dates. | # The only thing we've changed is we've set the '''''Labeled Value''''' extractor's '''''Value Extractor''''' to a ''Reference'' extractor pointing to a '''Data Type''' returning dates. | ||
# Upon testing extraction, we can see now the '''Data Field''' collects the value we want, the invoice's date "02/26/2014" | # Upon testing extraction, we can see now the '''Data Field''' collects the value we want, the invoice's date "02/26/2014" | ||
# By configuring the ''Labeled Value'' extractor's '''''Value Extractor''''', it's no longer looking for just simple segments next to the label. So, the word "Page" is no longer returned. Instead, it's looking for results matching the '''''Value Extractor's''''' results. | # By configuring the '''''Labeled Value''''' extractor's '''''Value Extractor''''', it's no longer looking for just simple segments next to the label. So, the word "Page" is no longer returned. Instead, it's looking for results matching the '''''Value Extractor's''''' results. | ||
#* This increases the specificity of what the ''Labeled Value'' returns. Increased specificity yields increased accuracy. | #* This increases the specificity of what the '''''Labeled Value''''' returns. Increased specificity yields increased accuracy. | ||
|valign=top| | |valign=top| | ||
[[File:Labeling-behavior-how-to-field-extraction-06.png]] | [[File:Labeling-behavior-how-to-field-extraction-06.png]] | ||
|- | |- | ||
|valign=top| | |valign=top| | ||
Configuring the ''Labeled Value'' extractor's '''''Value Extractor''''' also gives you the myriad of functionalities available to extractors. For example, ''Fuzzy RegEx'' is one of the main ways '''Grooper''' gets around poor OCR data at the time of extraction. When the text data is just a couple characters off of the extractor's regex pattern, ''Fuzzy RegEx'' can not only match the imperfect data but "swap" the wrong characters for the right ones, effectively cleansing your result. | Configuring the '''''Labeled Value''''' extractor's '''''Value Extractor''''' also gives you the myriad of functionalities available to extractors. For example, ''Fuzzy RegEx'' is one of the main ways '''Grooper''' gets around poor OCR data at the time of extraction. When the text data is just a couple characters off of the extractor's regex pattern, ''Fuzzy RegEx'' can not only match the imperfect data but "swap" the wrong characters for the right ones, effectively cleansing your result. | ||
# Take the "Invoice Amount" '''Data Field''' for example. | # Take the "Invoice Amount" '''Data Field''' for example. | ||
# Here, the '''Data Field's''' '''''Value Extractor''''' is set to ''Labeled Value''. | # Here, the '''Data Field's''' '''''Value Extractor''''' is set to '''''Labeled Value'''''. | ||
# And, the ''Labeled Value'' extractor's '''''Value Extractor''''' is left unconfigured. | # And, the '''''Labeled Value''''' extractor's '''''Value Extractor''''' is left unconfigured. | ||
# The ''Labeled Value'' extractor first locates the collected label <code>Amount Due</code> and without a configured '''''Value Extractor''''' returns the nearest text segment (according to the '''''Maximum Distance''''' settings). | # The '''''Labeled Value''''' extractor first locates the collected label <code>Amount Due</code> and without a configured '''''Value Extractor''''' returns the nearest text segment (according to the '''''Maximum Distance''''' settings). | ||
# This is ''almost'' the result we want. | # This is ''almost'' the result we want. | ||
#* It's the "right" result in that, yes, that is the text segment that corresponds to the invoice amount due for this invoice. | #* It's the "right" result in that, yes, that is the text segment that corresponds to the invoice amount due for this invoice. | ||
| Line 1,074: | Line 1,074: | ||
|- | |- | ||
|valign=top| | |valign=top| | ||
However, that's just a single character off from being the right result. We could build an extractor to return currency values looking to make fuzzy swaps like this, both matching text that is slightly off and reformatting the result to match a valid currency format. If we used that extractor as the ''Labeled Value'' extractor's '''''Value Extractor''''' it would not only find the segment but also reformat the result, swapping the mis-OCR'd period for what it should be, a comma. | However, that's just a single character off from being the right result. We could build an extractor to return currency values looking to make fuzzy swaps like this, both matching text that is slightly off and reformatting the result to match a valid currency format. If we used that extractor as the '''''Labeled Value''''' extractor's '''''Value Extractor''''' it would not only find the segment but also reformat the result, swapping the mis-OCR'd period for what it should be, a comma. | ||
And we've done just that. | And we've done just that. | ||
# Here, we've set the ''Labeled Value'' extractor's '''''Value Extractor''''' to reference a '''Data Type''' returning fuzzy matched currency values. | # Here, we've set the '''''Labeled Value''''' extractor's '''''Value Extractor''''' to reference a '''Data Type''' returning fuzzy matched currency values. | ||
# The '''''Value Extractor''''' matches the text we want, below the label <code>Amount Due</code> | # The '''''Value Extractor''''' matches the text we want, below the label <code>Amount Due</code> | ||
# And since the referenced extractor uses ''Fuzzy RegEx'' the returned result is now a valid currency value. | # And since the referenced extractor uses ''Fuzzy RegEx'' the returned result is now a valid currency value. | ||
| Line 1,098: | Line 1,098: | ||
{|cellpadding=10 cellpadding=5 | {|cellpadding=10 cellpadding=5 | ||
|valign=top style="width:40%"| | |valign=top style="width:40%"| | ||
Continuing from the [[#Using the Labeled Value Extractor Type with Label Sets|tutorial above's]] discussion of an unconfigured | Continuing from the [[#Using the Labeled Value Extractor Type with Label Sets|tutorial above's]] discussion of an unconfigured Labeled Value '''''Value Extractor'''''', let's examine the results of the "Purchase Order Number" '''Data Field'''. | ||
# We've selected the "Purchase Order Number" '''Data Field''' in the Node Tree. | # We've selected the "Purchase Order Number" '''Data Field''' in the Node Tree. | ||
# The '''Data Field's''' '''''Value Extractor''''' property is set to ''Labeled Value''. | # The '''Data Field's''' '''''Value Extractor''''' property is set to ''Labeled Value''. | ||
# It currently does not have the ''Labeled Value'' extractor's '''''Value Extractor''''' configured. | # It currently does not have the '''''Labeled Value''''' extractor's '''''Value Extractor''''' configured. | ||
# Left unconfigured, we get an undesirable result, a rather large text segment "Order Date Customer No. Salesperson Order No. Ship Via". | # Left unconfigured, we get an undesirable result, a rather large text segment "Order Date Customer No. Salesperson Order No. Ship Via". | ||
This is obviously not what we want. We want the purchase order number listed below it. Ultimately, we will follow best practice and configure the ''Labeled Value'' extractor's '''''Value Extractor''''' property. | This is obviously not what we want. We want the purchase order number listed below it. Ultimately, we will follow best practice and configure the '''''Labeled Value''''' extractor's '''''Value Extractor''''' property. | ||
However, before we do, this gives us an opportunity to demonstrate some additional functionality of the ''Labeling Behavior''. | However, before we do, this gives us an opportunity to demonstrate some additional functionality of the ''''Labeling Behavior'''''. | ||
This data "Order Date Customer No. Salesperson Order No. Ship Via" is itself comprised of labels pointing to various values on the document. Even though we haven't set up '''Data Fields''' in this '''Data Model''' to capture the values they point to, we know this is data we don't want. In general, you don't want to use '''Grooper''' to extract labels, you want to extract values. | This data "Order Date Customer No. Salesperson Order No. Ship Via" is itself comprised of labels pointing to various values on the document. Even though we haven't set up '''Data Fields''' in this '''Data Model''' to capture the values they point to, we know this is data we don't want. In general, you don't want to use '''Grooper''' to extract labels, you want to extract values. | ||
| Line 1,114: | Line 1,114: | ||
|} | |} | ||
What's happening here is '''Grooper''' is returning ''all'' the text on this single line until a collected label in this '''Document Type's''' label set is located. In this case, the label <code>Terms</code> was collected for the "Payment Terms" '''Data Field'''. None of the text between the label <code>PO Number</code> and the label <code>Terms</code> have been collected in the label set. So, the ''Labeled Value'' extractor returns all the text to the right of the "PO Number" '''Data Field's''' label (<code>PO Number</code>) and the next encountered label (<code>Terms</code>), resulting in "Order Date Customer No. Salesperson Order Number Ship Via". | What's happening here is '''Grooper''' is returning ''all'' the text on this single line until a collected label in this '''Document Type's''' label set is located. In this case, the label <code>Terms</code> was collected for the "Payment Terms" '''Data Field'''. None of the text between the label <code>PO Number</code> and the label <code>Terms</code> have been collected in the label set. So, the '''''Labeled Value''''' extractor returns all the text to the right of the "PO Number" '''Data Field's''' label (<code>PO Number</code>) and the next encountered label (<code>Terms</code>), resulting in "Order Date Customer No. Salesperson Order Number Ship Via". | ||
{|cellpadding="10" cellspacing="5" | {|cellpadding="10" cellspacing="5" | ||
| Line 1,121: | Line 1,121: | ||
'''⚠''' | '''⚠''' | ||
| | | | ||
This is very specific functionality to the ''Labeled Value'' extractor and its interaction with label sets. It will only behave this way if you: | This is very specific functionality to the '''''Labeled Value''''' extractor and its interaction with label sets. It will only behave this way if you: | ||
# Are using the ''Labeling Behavior'' and the '''Data Field's''' '''''Value Extractor''''' is set to ''Labeled Value''. | # Are using the '''''Labeling Behavior''''' and the '''Data Field's''' '''''Value Extractor''''' is set to ''Labeled Value''. | ||
# Have collected other labels on the same line as the '''Data Field's''' label. | # Have collected other labels on the same line as the '''Data Field's''' label. | ||
# Have ''not'' configured the ''Labeled Value'' extractor's '''''Value Extractor'''''. | # Have ''not'' configured the '''''Labeled Value''''' extractor's '''''Value Extractor'''''. | ||
|} | |} | ||
| Line 1,150: | Line 1,150: | ||
'''FYI''' | '''FYI''' | ||
| | | | ||
Keep in mind this is very specific functionality to the ''Labeled Value'' extractor and its interaction with label sets. It will only behave this way if you: | Keep in mind this is very specific functionality to the '''''Labeled Value''''' extractor and its interaction with label sets. It will only behave this way if you: | ||
# Are using the ''Labeling Behavior'' and the '''Data Field's''' '''''Value Extractor''''' is set to ''Labeled Value''. | # Are using the '''''Labeling Behavior''''' and the '''Data Field's''' '''''Value Extractor''''' is set to ''Labeled Value''. | ||
# Have collected other labels on the same line as the '''Data Field's''' label. | # Have collected other labels on the same line as the '''Data Field's''' label. | ||
# Have ''not'' configured the ''Labeled Value'' extractor's '''''Value Extractor'''''. | # Have ''not'' configured the '''''Labeled Value''''' extractor's '''''Value Extractor'''''. | ||
|} | |} | ||
| | | | ||
| Line 1,161: | Line 1,161: | ||
If we were to go one step further and add a <code>Order Date</code> Custom Label, we wouldn't get any result returned at all! | If we were to go one step further and add a <code>Order Date</code> Custom Label, we wouldn't get any result returned at all! | ||
There is no text between the '''Data Field's''' label and another label in the label set, the ''Labeled Value | There is no text between the '''Data Field's''' label and another label in the label set, the '''''Labeled Value''''' extractor will return absolutely nothing at all. | ||
{|cellpadding="10" cellspacing="5" | {|cellpadding="10" cellspacing="5" | ||
| Line 1,170: | Line 1,170: | ||
One last time, for emphasis... | One last time, for emphasis... | ||
Keep in mind this is very specific functionality to the ''Labeled Value'' extractor and its interaction with label sets. It will only behave this way if you: | Keep in mind this is very specific functionality to the '''''Labeled Value''''' extractor and its interaction with label sets. It will only behave this way if you: | ||
# Are using the ''Labeling Behavior'' and the '''Data Field's''' '''''Value Extractor''''' is set to ''Labeled Value''. | # Are using the '''''Labeling Behavior''''' and the '''Data Field's''' '''''Value Extractor''''' is set to ''Labeled Value''. | ||
# Have collected other labels on the same line as the '''Data Field's''' label. | # Have collected other labels on the same line as the '''Data Field's''' label. | ||
# Have ''not'' configured the ''Labeled Value'' extractor's '''''Value Extractor'''''. | # Have ''not'' configured the '''''Labeled Value''''' extractor's '''''Value Extractor'''''. | ||
|} | |} | ||
| | | | ||
| Line 1,181: | Line 1,181: | ||
<span style="font-size:125%">'''HOWEVER, this was not the right solution for this problem.'''</span> | <span style="font-size:125%">'''HOWEVER, this was not the right solution for this problem.'''</span> | ||
This was ''only'' an educational exercise to make you aware of how labels in a label set interact with the ''Labeled Value'' | This was ''only'' an educational exercise to make you aware of how labels in a label set interact with the '''''Labeled Value''''' extractor when its '''''Value Extractor''''' is left unconfigured. | ||
We ''should'' have followed our best practice advice and configured the ''Labeled Value'' extractor's '''''Value Extractor'''''. We did not really have to go through the trouble of adding a bunch of Custom Labels. With the ''Labeled Value'' extractor's '''''Value Extractor''''' configured, it's going to ignore this whole business of finding a nearby segment or returning text on a line up to the next label in a label set and more specifically return the data you want to target. | We ''should'' have followed our best practice advice and configured the '''''Labeled Value''''' extractor's '''''Value Extractor'''''. We did not really have to go through the trouble of adding a bunch of Custom Labels. With the '''''Labeled Value''''' extractor's '''''Value Extractor''''' configured, it's going to ignore this whole business of finding a nearby segment or returning text on a line up to the next label in a label set and more specifically return the data you want to target. | ||
# Here, we have the ''Labeled Value'' extractor's '''''Value Extractor''''' configured to reference a '''Data Type''' returning various purchase order number formats. | # Here, we have the '''''Labeled Value''''' extractor's '''''Value Extractor''''' configured to reference a '''Data Type''' returning various purchase order number formats. | ||
# Even without adding all the extra Custom Labels, we get what we want. The "Purchase Order Number" '''Data Field''' collects the purchase order number on the document, "PO009845", upon testing extraction. | # Even without adding all the extra Custom Labels, we get what we want. The "Purchase Order Number" '''Data Field''' collects the purchase order number on the document, "PO009845", upon testing extraction. | ||
| Line 1,201: | Line 1,201: | ||
{|cellpadding=10 cellpadding=5 | {|cellpadding=10 cellpadding=5 | ||
|valign=top style="width:40%"| | |valign=top style="width:40%"| | ||
The '''''Maximum Noise''''' property of the ''Labeled Value'' extractor controls the maximum number of "noise characters" allowed in the "bounding-region" of a label-value pair. | The '''''Maximum Noise''''' property of the '''''Labeled Value''''' extractor controls the maximum number of "noise characters" allowed in the "bounding-region" of a label-value pair. | ||
Now, what does that mean? Let's look at an example, using the "Remit Address" '''Data Field''' of our example '''Data Model'''. | Now, what does that mean? Let's look at an example, using the "Remit Address" '''Data Field''' of our example '''Data Model'''. | ||
| Line 1,207: | Line 1,207: | ||
# We've selected the "Remit Address" '''Data Field'''. | # We've selected the "Remit Address" '''Data Field'''. | ||
# The '''Data Field's''' '''''Value Extractor''''' is set to ''Labeled Value''. | # The '''Data Field's''' '''''Value Extractor''''' is set to ''Labeled Value''. | ||
# The ''Labeled Value'' extractor's '''''Label Extractor''''' is left unconfigured. | # The '''''Labeled Value''''' extractor's '''''Label Extractor''''' is left unconfigured. | ||
#* The extractor will use the collected label for this '''Data Field''' for each '''Document Type'''. | #* The extractor will use the collected label for this '''Data Field''' for each '''Document Type'''. | ||
# The ''Labeled Value'' extractor's '''''Value Extractor''''' is configured to reference a '''Data Type''' returning all addresses for this document set. | # The '''''Labeled Value''''' extractor's '''''Value Extractor''''' is configured to reference a '''Data Type''' returning all addresses for this document set. | ||
#* We've followed best practice here and assigned a '''''Value Extractor'''''. There's nothing wrong with the referenced '''Data Type''' (named "VAL - Address"). It returns the street address and city, state, zip code line for all addresses on these invoices. | #* We've followed best practice here and assigned a '''''Value Extractor'''''. There's nothing wrong with the referenced '''Data Type''' (named "VAL - Address"). It returns the street address and city, state, zip code line for all addresses on these invoices. | ||
# What we should get upon extracting the document is this: | # What we should get upon extracting the document is this: | ||
| Line 1,223: | Line 1,223: | ||
Noise characters are any letters and digits falling within the bounding region defined by a label value. For our example, the bounding region looks like this. | Noise characters are any letters and digits falling within the bounding region defined by a label value. For our example, the bounding region looks like this. | ||
# The label, highlighted in blue, is established by the ''Labled Value'' extractor's '''''Label Extractor''''' result. | # The label, highlighted in blue, is established by the '''''Labled Value''''' extractor's '''''Label Extractor''''' result. | ||
# The value, highlighted in green, is established by the '' Labeled Value'' extractor's '''''Value Extractor''''' result. | # The value, highlighted in green, is established by the '''''Labeled Value''''' extractor's '''''Value Extractor''''' result. | ||
# The bounding region, highlighted in yellow, is the smallest rectangle which can enclose both the label and the value. | # The bounding region, highlighted in yellow, is the smallest rectangle which can enclose both the label and the value. | ||
|valign=top| | |valign=top| | ||
| Line 1,258: | Line 1,258: | ||
# Here, we've upped the '''''Maximum Noise''''' property to ''25''. | # Here, we've upped the '''''Maximum Noise''''' property to ''25''. | ||
# Upon extraction, the ''Labeled Value'' counts the number of noise characters in the bounding region between the label and the value. | # Upon extraction, the '''''Labeled Value''''' counts the number of noise characters in the bounding region between the label and the value. | ||
# If the number of noise characters is less than the '''''Maximum Noise''''' property's number, the result is returned. | # If the number of noise characters is less than the '''''Maximum Noise''''' property's number, the result is returned. | ||
#* 15 is less than 25. Therefore, the result is returned. | #* 15 is less than 25. Therefore, the result is returned. | ||
| Line 1,272: | Line 1,272: | ||
For '''Data Field''' objects, you can collect both a "Header Label" as well as a "Footer Label". As we've seen the Header Label is the text label for whatever field you're trying to extract. Essentially, the text label marks the ''beginning'' of the field's content. | For '''Data Field''' objects, you can collect both a "Header Label" as well as a "Footer Label". As we've seen the Header Label is the text label for whatever field you're trying to extract. Essentially, the text label marks the ''beginning'' of the field's content. | ||
The Footer Label is an optional label used to mark the ''end'' of the field's content. The Footer Label is useful when leaving the ''Labeled Value'' extractor's '''''Value Extractor''''' unconfigured. While it is still always considered best practice to configure the ''Labeled Value'' extractor's '''''Value Extractor''''', there are certain types of data that are difficult to match with regular expression. For example, a person's name. In these types of situations where you ''must'' run the ''Labeled Value'' extractor ''without'' a '''''Value Extractor''''', a Footer Label can often aid you in throwing out false positive or "junk" data. | The Footer Label is an optional label used to mark the ''end'' of the field's content. The Footer Label is useful when leaving the '''''Labeled Value''''' extractor's '''''Value Extractor''''' unconfigured. While it is still always considered best practice to configure the '''''Labeled Value''''' extractor's '''''Value Extractor''''', there are certain types of data that are difficult to match with regular expression. For example, a person's name. In these types of situations where you ''must'' run the '''''Labeled Value''''' extractor ''without'' a '''''Value Extractor''''', a Footer Label can often aid you in throwing out false positive or "junk" data. | ||
{|cellpadding=10 cellspacing=5 | {|cellpadding=10 cellspacing=5 | ||
| Line 1,280: | Line 1,280: | ||
We would create a '''Data Field''' and collect a label for <code>Settlement Agent</code>. We would then set that '''Data Field's''' '''''Value Extractor'''''' to ''Labeled Value''. | We would create a '''Data Field''' and collect a label for <code>Settlement Agent</code>. We would then set that '''Data Field's''' '''''Value Extractor'''''' to ''Labeled Value''. | ||
In the case of this document, we would get the result we wanted. The ''Labeled Value'' extractor's '''''Label Extractor''''' would match the collected label (stroked in blue). If left unconfigured, its '''''Value Extractor''''' would return the nearest segment to that label's location (according to its layout settings and operation discussed previously in this tutorial). This is exactly what we want, the highlighted name "Jourdain Meardon". | In the case of this document, we would get the result we wanted. The '''''Labeled Value''''' extractor's '''''Label Extractor''''' would match the collected label (stroked in blue). If left unconfigured, its '''''Value Extractor''''' would return the nearest segment to that label's location (according to its layout settings and operation discussed previously in this tutorial). This is exactly what we want, the highlighted name "Jourdain Meardon". | ||
| | | | ||
[[File:Labeling-behavior-how-to-field-extraction-footer-01.png]] | [[File:Labeling-behavior-how-to-field-extraction-footer-01.png]] | ||
| Line 1,292: | Line 1,292: | ||
|- | |- | ||
|valign=top| | |valign=top| | ||
With a Footer Label, we can change how the ''Labeled Value'' operates when its '''''Value Extractor''''' is left unconfigured. | With a Footer Label, we can change how the '''''Labeled Value''''' operates when its '''''Value Extractor''''' is left unconfigured. | ||
If we collect <code>Seller</code> for the "Settlement Agent" '''Data Field's''' Footer Label (stroked in red), we will restrict ''Labeled Value'' to only return text between the Header and Footer Labels (highlighted in yellow). With no text falling between the header and footer, the false positives will not return. In fact, no value will return at all! | If we collect <code>Seller</code> for the "Settlement Agent" '''Data Field's''' Footer Label (stroked in red), we will restrict '''''Labeled Value''''' to only return text between the Header and Footer Labels (highlighted in yellow). With no text falling between the header and footer, the false positives will not return. In fact, no value will return at all! | ||
| | | | ||
[[File:Labeling-behavior-how-to-field-extraction-footer-03.png]] | [[File:Labeling-behavior-how-to-field-extraction-footer-03.png]] | ||
| Line 1,329: | Line 1,329: | ||
When we test extraction for the "Settlement Agent" '''Data Field''' now, we get very different results. | When we test extraction for the "Settlement Agent" '''Data Field''' now, we get very different results. | ||
# With a '''''Footer''''' Label added, and the ''Labeled Value'' extractor's '''''Value Extractor''''' unconfigured... | # With a '''''Footer''''' Label added, and the '''''Labeled Value''''' extractor's '''''Value Extractor''''' unconfigured... | ||
# ...the extractor will ''only'' return text between the '''''Header''''' Label and the '''''Footer''''' Label. | # ...the extractor will ''only'' return text between the '''''Header''''' Label and the '''''Footer''''' Label. | ||
#* In our case, only text between <code>Settlement Agent</code> and <code>Seller</code>. | #* In our case, only text between <code>Settlement Agent</code> and <code>Seller</code>. | ||
| Line 1,342: | Line 1,342: | ||
'''⚠''' | '''⚠''' | ||
| | | | ||
This is very specific functionality to the ''Labeled Value'' extractor and its interaction with Label Sets. It will only behave this way if you: | This is very specific functionality to the '''''Labeled Value''''' extractor and its interaction with Label Sets. It will only behave this way if you: | ||
# Are using the ''Labeling Behavior'' and the '''Data Field's''' '''''Value Extractor''''' is set to ''Labeled Value''. | # Are using the '''''Labeling Behavior''''' and the '''Data Field's''' '''''Value Extractor''''' is set to ''Labeled Value''. | ||
# Have collected ''both'' a '''''Header''''' Label and a '''''Footer''''' Label for the '''Data Field'''. | # Have collected ''both'' a '''''Header''''' Label and a '''''Footer''''' Label for the '''Data Field'''. | ||
# Have ''not'' configured the ''Labeled Value'' extractor's '''''Label Extractor''''' or '''''Value Extractor'''''. | # Have ''not'' configured the '''''Labeled Value''''' extractor's '''''Label Extractor''''' or '''''Value Extractor'''''. | ||
|} | |} | ||
| Line 1,395: | Line 1,395: | ||
{|cellpadding=10 cellspacing=5 | {|cellpadding=10 cellspacing=5 | ||
|valign=top style="width:40%"| | |valign=top style="width:40%"| | ||
Now that the '''''Static''''' label is collected, how does '''Grooper''' know to return it during extraction when the '''Extract''' activity runs? The short answer is the ''Labeled Value'' extractor type will do this for us. | Now that the '''''Static''''' label is collected, how does '''Grooper''' know to return it during extraction when the '''Extract''' activity runs? The short answer is the '''''Labeled Value''''' extractor type will do this for us. | ||
With "Factura Technology Corp" collected as a '''''Static''''' label, and the "Vendor Name" '''Data Field''' configured to utilize the ''Labeled Value'' extractor, it will return the '''''Static''''' label itself as the result. | With "Factura Technology Corp" collected as a '''''Static''''' label, and the "Vendor Name" '''Data Field''' configured to utilize the '''''Labeled Value''''' extractor, it will return the '''''Static''''' label itself as the result. | ||
# Here, we have the "Vendor Name" '''Data Field''' selected in the Node Tree. | # Here, we have the "Vendor Name" '''Data Field''' selected in the Node Tree. | ||
# The '''Data Field's''' '''''Value Extractor''''' property is set to use the ''Labeled Value'' extractor type. | # The '''Data Field's''' '''''Value Extractor''''' property is set to use the '''''Labeled Value''''' extractor type. | ||
# The ''Labeled Value'' extractor's '''''Label Extractor''''' and '''''Value Extractor''''' are both unconfigured. | # The '''''Labeled Value''''' extractor's '''''Label Extractor''''' and '''''Value Extractor''''' are both unconfigured. | ||
# With this ''Labeled Value'' configuration, and a '''''Static''''' label collected for this '''Data Field''', the '''''Static''''' label is itself what the extractor is looking for on the document. | # With this '''''Labeled Value''''' configuration, and a '''''Static''''' label collected for this '''Data Field''', the '''''Static''''' label is itself what the extractor is looking for on the document. | ||
# If present, it will be returned and collected at time of extraction when the '''Extract''' activity runs. | # If present, it will be returned and collected at time of extraction when the '''Extract''' activity runs. | ||
| | | | ||
| Line 1,418: | Line 1,418: | ||
=== About Label Match === | === About Label Match === | ||
The ''Label Match'' extractor is ''extremely'' similar to the ''List Match'' extractor in that it matches one or more items in a defined list. However, it is designed specifically to work with the ''Labeling Behavior'' functionality. It will use the fuzzy extraction and vertical and constrained wrapping settings defined on the '''Content Model''' if a ''Labeling Behavior'' is enabled. This way, you can have a single, unified set of fuzzy match settings for multiple extractors. Rather than configuring these settings, including the confidence score threshold and fuzzy weighting, for multiple extractors, you can configure them just once when enabling the ''Labeling Behavior'' and all ''Label Match'' extractors will use them. | The ''Label Match'' extractor is ''extremely'' similar to the ''List Match'' extractor in that it matches one or more items in a defined list. However, it is designed specifically to work with the '''''Labeling Behavior''''' functionality. It will use the fuzzy extraction and vertical and constrained wrapping settings defined on the '''Content Model''' if a '''''Labeling Behavior''''' is enabled. This way, you can have a single, unified set of fuzzy match settings for multiple extractors. Rather than configuring these settings, including the confidence score threshold and fuzzy weighting, for multiple extractors, you can configure them just once when enabling the '''''Labeling Behavior''''' and all ''Label Match'' extractors will use them. | ||
* For more information on fuzzy extraction, visit the [[Fuzzy RegEx]] article. | * For more information on fuzzy extraction, visit the [[Fuzzy RegEx]] article. | ||
| Line 1,425: | Line 1,425: | ||
# The document folder must be classified. | # The document folder must be classified. | ||
#* In other words, it must have a '''Document Type''' assigned to it. | #* In other words, it must have a '''Document Type''' assigned to it. | ||
# That '''Document Type''' must have a ''Labeling Behavior'' enabled. | # That '''Document Type''' must have a '''''Labeling Behavior''''' enabled. | ||
#* Either on the '''Document Type''' or, more typically, its parent '''Content Model'''. | #* Either on the '''Document Type''' or, more typically, its parent '''Content Model'''. | ||
</tab> | </tab> | ||
| Line 1,442: | Line 1,442: | ||
#* <code>$|[^\w]</code> is the default '''''Suffix Pattern'''''. | #* <code>$|[^\w]</code> is the default '''''Suffix Pattern'''''. | ||
# The document we have selected is classified as an "Invoice" '''Document Type'''. | # The document we have selected is classified as an "Invoice" '''Document Type'''. | ||
# This is a '''Document Type''' in the '''Content Model''' with the ''Labeling Behavior'' enabled. | # This is a '''Document Type''' in the '''Content Model''' with the '''''Labeling Behavior''''' enabled. | ||
# Upon execution, notice some results are returned with a confidence ''below'' 100%. | # Upon execution, notice some results are returned with a confidence ''below'' 100%. | ||
#* This is due to the fuzzy matching settings configured from the ''Labeling Behavior''. The '''''Label Similarity''''' property was set to ''90%''. Any items in the list with a fuzzy matching similarity score above 90% are returned. Any falling below 90% (for example the list item <code>CALLER:</code>) are not. | #* This is due to the fuzzy matching settings configured from the '''''Labeling Behavior'''''. The '''''Label Similarity''''' property was set to ''90%''. Any items in the list with a fuzzy matching similarity score above 90% are returned. Any falling below 90% (for example the list item <code>CALLER:</code>) are not. | ||
#* Note this means changing the ''Labeling Behavior'' settings will impact ALL ''Label Match'' extractors for the '''Content Model's''' '''Document Types'''. | #* Note this means changing the '''''Labeling Behavior''''' settings will impact ALL ''Label Match'' extractors for the '''Content Model's''' '''Document Types'''. | ||
| | | | ||
[[File:Value-reader-extractor-types-label match-02-v2.png]] | [[File:Value-reader-extractor-types-label match-02-v2.png]] | ||
|} | |} | ||
Where are these ''Labeling Behavior'' settings again? | Where are these '''''Labeling Behavior''''' settings again? | ||
{|cellpadding=10 cellspacing=5 | {|cellpadding=10 cellspacing=5 | ||
|valign=top style="width:40%"| | |valign=top style="width:40%"| | ||
# The '''Content Model''' selected here, has enabled a ''Labeling Behavior''. | # The '''Content Model''' selected here, has enabled a '''''Labeling Behavior'''''. | ||
# ''Labeling Behavior'' is enabled using the '''''Behaviors''''' property... | # '''''Labeling Behavior''''' is enabled using the '''''Behaviors''''' property... | ||
# ...and added using the collection editor seen here, as discussed earlier in this article. | # ...and added using the collection editor seen here, as discussed earlier in this article. | ||
# The ''Label Match'' extractor will use all the fuzzy extraction and text wrapping settings defined here. | # The ''Label Match'' extractor will use all the fuzzy extraction and text wrapping settings defined here. | ||
| Line 1,478: | Line 1,478: | ||
This is also the basic idea behind the ''Tabular Layout'' '''''Extraction Method'''''. It too utilizes column header labels to "read" tables on documents, or at least as the step number one in modeling the table's structure so that '''Grooper''' can extract data from each cell in the table. | This is also the basic idea behind the ''Tabular Layout'' '''''Extraction Method'''''. It too utilizes column header labels to "read" tables on documents, or at least as the step number one in modeling the table's structure so that '''Grooper''' can extract data from each cell in the table. | ||
Furthermore, using the ''Tabular Layout'' method, collected label sets using a ''Labeling Behavior'' can also be used to extract data from tables on documents. In this case, the labels collected for the '''Data Column''' children of a '''Data Table''' are utilized to help model the table's structure. | Furthermore, using the ''Tabular Layout'' method, collected label sets using a '''''Labeling Behavior''''' can also be used to extract data from tables on documents. In this case, the labels collected for the '''Data Column''' children of a '''Data Table''' are utilized to help model the table's structure. | ||
Once the column header locations are established, the next requirement is a way to understand how many rows are in the table. This is done by configuring at least one '''Data Column's''' '''''Value Extractor''''' property. Generally, there is at least one column in a table that is always present for every row in the table. If you can use an extractor to locate that data below its corresponding column header, that gives you a way of finding each row in the table. | Once the column header locations are established, the next requirement is a way to understand how many rows are in the table. This is done by configuring at least one '''Data Column's''' '''''Value Extractor''''' property. Generally, there is at least one column in a table that is always present for every row in the table. If you can use an extractor to locate that data below its corresponding column header, that gives you a way of finding each row in the table. | ||
| Line 1,550: | Line 1,550: | ||
#* The label should read "Description" but OCR made some missteps and recognized that segment as "DescripUon". | #* The label should read "Description" but OCR made some missteps and recognized that segment as "DescripUon". | ||
#* The "ti" in "Description" were recognized as a capital "U". This means "Description" is two characters different from "Description" or roughly 82% similar. The ''Labeling Behavior's'' similarity threshold is set to ''90%'' for this '''Content Model'''. 81% is less than 90%. So, the result is thrown out. | #* The "ti" in "Description" were recognized as a capital "U". This means "Description" is two characters different from "Description" or roughly 82% similar. The ''Labeling Behavior's'' similarity threshold is set to ''90%'' for this '''Content Model'''. 81% is less than 90%. So, the result is thrown out. | ||
#** FYI, this threshold is configured when the ''Labeling Behavior'' is added using the '''''Behaviors''''' property of a '''Content Model'''. The '''''Label Similarity''''' property is set to ''90%'' by default, but can be adjusted at any time by configuring the ''Labeling Behavior'' item in the '''''Behaviors''''' list. | #** FYI, this threshold is configured when the '''''Labeling Behavior''''' is added using the '''''Behaviors''''' property of a '''Content Model'''. The '''''Label Similarity''''' property is set to ''90%'' by default, but can be adjusted at any time by configuring the '''''Labeling Behavior''''' item in the '''''Behaviors''''' list. | ||
As we will see, capturing the full row of column header labels will boost the similarity, allowing the label to match without altering the ''Label Behavior's'' fuzzy match settings. | As we will see, capturing the full row of column header labels will boost the similarity, allowing the label to match without altering the ''Label Behavior's'' fuzzy match settings. | ||
| Line 1,665: | Line 1,665: | ||
# For the selected document folder in the "Batch Viewer" window... | # For the selected document folder in the "Batch Viewer" window... | ||
# Press the "Test Extraction" button. | # Press the "Test Extraction" button. | ||
#* Side note: We've seen before we can test extraction using the "Labels" tab of a '''Content Model''' or '''Document Type''' when ''Labeling Behavior'' is enabled. The only real difference is we're testing extraction for the specific '''Data Element''' selected in the Node Tree. In this case the "Line Items" '''Data Model'''. The "Test" button in the "Labels" tab will test extraction for the ''entire'' '''Data Model''' and all its component child '''Data Elements'''. However, feel free to test extraction at either location. The end result is the same. We're testing to verify extraction results. | #* Side note: We've seen before we can test extraction using the "Labels" tab of a '''Content Model''' or '''Document Type''' when '''''Labeling Behavior''''' is enabled. The only real difference is we're testing extraction for the specific '''Data Element''' selected in the Node Tree. In this case the "Line Items" '''Data Model'''. The "Test" button in the "Labels" tab will test extraction for the ''entire'' '''Data Model''' and all its component child '''Data Elements'''. However, feel free to test extraction at either location. The end result is the same. We're testing to verify extraction results. | ||
# The results show up in the "Data Element Preview" window. | # The results show up in the "Data Element Preview" window. | ||
| Line 1,764: | Line 1,764: | ||
To do this, we will need to update the "Factura" '''Document Type's''' Label Set. | To do this, we will need to update the "Factura" '''Document Type's''' Label Set. | ||
# Navigate to the '''Content Model''' with ''Labeling Behavior'' enabled in the Node Tree. | # Navigate to the '''Content Model''' with '''''Labeling Behavior''''' enabled in the Node Tree. | ||
# Switch to the "Labels" tab. | # Switch to the "Labels" tab. | ||
# In the "Batch Selector", select the '''Document Type''' whose Label Set you wish to edit. | # In the "Batch Selector", select the '''Document Type''' whose Label Set you wish to edit. | ||
| Line 2,325: | Line 2,325: | ||
|style="font-size:14pt"|'''FYI''' | |style="font-size:14pt"|'''FYI''' | ||
| | | | ||
What does this have to do with ''Labeling Behavior'' and Label Sets? | What does this have to do with '''''Labeling Behavior''''' and Label Sets? | ||
We're getting there. Ultimately, ''Transaction Detection'' is "Label Set aware" and can take advantage of collected '''''Header''''' and '''''Footer''''' labels for a '''Data Section''' object. However, collecting labels for the '''Data Section''' will quite dramatically change how ''Transaction Detection'' works. | We're getting there. Ultimately, ''Transaction Detection'' is "Label Set aware" and can take advantage of collected '''''Header''''' and '''''Footer''''' labels for a '''Data Section''' object. However, collecting labels for the '''Data Section''' will quite dramatically change how ''Transaction Detection'' works. | ||
| Line 2,856: | Line 2,856: | ||
=== Data Element Override Utility === | === Data Element Override Utility === | ||
Earlier in this article, we talked about using the ''Labeled Value | Earlier in this article, we talked about using the '''''Labeled Value''''' Extractor Type ''without'' configuring its '''''Value Extractor'''''. Again, it is considered best practice to configure its '''''Value Extractor'''''. However, sometimes data is difficult to pattern match. For example, crafting an extractor to return people or company names can be difficult to craft. It is truly these cases why the option to leave a '''''Labeled Value''''' extractor's '''''Value Extractor''''' unconfigured is an option with Label Sets. | ||
To make the best use of this functionality, Data Element Overrides are typically necessary. Indeed, because the Label Set approach is more templated in nature, Data Element Overrides can be a useful tool to fine tune extraction for one specific '''Document Type'''. In this section, we will use the "Purchase Order Number" '''Data Field''' of our "Labeling Behavior - Invoices - Model" '''Content Model''' to demonstrate this. | To make the best use of this functionality, Data Element Overrides are typically necessary. Indeed, because the Label Set approach is more templated in nature, Data Element Overrides can be a useful tool to fine tune extraction for one specific '''Document Type'''. In this section, we will use the "Purchase Order Number" '''Data Field''' of our "Labeling Behavior - Invoices - Model" '''Content Model''' to demonstrate this. | ||
| Line 3,040: | Line 3,040: | ||
=== 2021 === | === 2021 === | ||
The ''Labeling Behavior'' is brand new functionality in '''Grooper''' version '''2021'''. Prior to this version, its functionality may have been able to be approximated by other objects and their properties (For example, a '''Data Type''' using the ''Key-Value Pair'' collation is at least in some ways similar to how the ''Labeled Value | The '''''Labeling Behavior''''' is brand new functionality in '''Grooper''' version '''2021'''. Prior to this version, its functionality may have been able to be approximated by other objects and their properties (For example, a '''Data Type''' using the ''Key-Value Pair'' collation is at least in some ways similar to how the '''''Labeled Value''''' Extractor Type works). However, creation of label sets using '''Document Types''' and their implementation described above was not available prior to version '''2021'''. | ||
Revision as of 08:49, 13 May 2022
The Labeling Behavior is a Content Type Behavior designed to collect and utilize a document's field labels in a variety of ways. This includes functionality for classification and data extraction.
The Labeling Behavior functionality allows Grooper users to quickly onboard new Document Types for structured and semi-structured forms, utilizing labels as a thumbprint for classification and data extraction purposes. Once the Labeling Behavior is enabled, labels are identified and collected using the "Labels" tab of Document Types. These "Label Sets" can then be used for the following purposes:
- Document classification - Using the Labelset-Based Classification Method
- Field based data extraction - Primarily using the Labeled Value Extractor Type
- Tabular data extraction - Primarily using a Data Table object's Tabular Layout Extract Method
- Sectional data extraction - Primarily using a Data Section object's Transaction Detection Extract Method
| FYI | The Labeling Behavior and its functionality discussed in this article are often referred to as "Label Set Behavior" or simply "Label Sets". |
About
|
You may download and import the file below into your own Grooper environment (version 2021). This contains the Batch(es) with the example document(s) discussed in this article and the Content Model(s) configured according to the How To section's instructions. |

Labels serve an important function on documents. They give the reader critical context to understand where data is located and what it means. How do you know the difference between the date on an invoice document indicating when the invoice was sent and the date indicating when you should pay the invoice? It's the labels. The labels are what distinguishes one type of date from another. For example, "Invoice Date" for the date the invoice was sent and "Due Date" for the date you need to pay by.
Labels can be a way of classifying documents as well. What does one individual label tell you about a document? Well, maybe not much. However, if you take them all together, they can tell you quite a bit about the kind of document you're looking at. For example, a W-4 employee withholding form is going to use different labels than an employee healthcare enrollment form. These are two very different documents collecting very different information. The labels used to collect this information are thus different as well.
Furthermore, you can even tell the difference between two very closely related documents using labels as well. For example, two different invoices from two different vendors may share some similarity in the labels they use to detail information. But there will be some differences as well. These differences can be useful identifiers to distinguish one from the other. Put all together, labels can act as a thumbprint Grooper can use to classify a document as one Document Type or another.
The Labeling Behavior is built on these concepts, collecting and utilizing labels for Document Types in a Content Model for classification and data extraction purposes.
|
As a Behavior, the Labeling Behavior is enabled on a Content Type object in Grooper.
|
|||
|
|||
|
Once the Labeling Behavior is enabled, the next big step is collecting label sets for the various Document Types in your Content Model.
Each Document Type has its own set of labels used to define information on the document. For example, the "Factura" Document Type in this Content Model uses the label "PO Number" to call out the purchase order number on this invoice document. A different Document Type, corresponding to a different invoice format, might use a different label such as "Purchase Order Number" or "PO #".
For more information on collecting label sets for the Document Types in your Content Model see the How To section of this article. |
|||
|
Once label sets are collected for each Document Type, they can be used for classification and data extraction purposes. For example, labels were used in this case to:
For more information on how to use labels for these purposes, see the How To section of this article. |
How To
The Labeling Behavior (often referred to as "Label Set Behavior" or just "Label Sets") are well suited for structured and semi-structured document sets. Label Sets are particularly useful for situations where you have multiple variations for one kind of document or another. While the information you want to extract from the document set may be the same from variation to variation, how the data is laid out and labeled may be very different from one variation of the document to another. Label Sets allow you to quickly onboard new Document Types to capture new form structures.
|
We will use invoices for the document set in the following tutorials. In a perfect world, you'd create a Content Model with a single "Invoice" Document Type whose Data Model would successfully extract all Data Elements for all invoices from all vendors every time no matter what. This is often not the case. You may find you need to add multiple Document Types to account for variations of an invoice from multiple vendors. Label Sets give you a method of quickly adding to Document Types to model new variations. In our case, we will presume we need to create one Document Type for each vendor. We will start with five Document Types for invoices from five vendors.
|
Collect Label Sets
|
Collecting labels for the Document Types in your Content Model will be the first thing you want to do after enabling the Labeling Behavior. Labels for each Data Element in the Document Type's Data Model are defined using the "Labels" tab of the Content Model.
|
|||
|
Collect Field Labels
Now that this document has been classified (assigned a Document Type from our Content Model), we can collect labels for its Document Type. This can be done in one of three ways:
- Lassoing the label in the "Document Viewer".
- Double-clicking the label in the Document Viewer.
- Typing the label in manually.
| ‼ | Going forward, this tutorial presumes you have obtained machine readable text from these documents, either OCR'd text or native text, via the Recognize activity. |
|
Generally the quickest way is by simply lassoing the label in the "Document Viewer".
|
|||
|
|||
|
If you choose, you may also manually enter a label for a Data Element by simply typing it into the text box.
|
|||
|
|||
|
Collect Table and Column Labels
|
Table and column labels can be used for tabular data extraction as well, setting a Data Table object to use the Tabular Layout Extract Method. When collecting labels for this method of table extraction, keep in mind you must collect the individual column headers, and may optionally collect both the full row of column header labels as well. While it is optional, it is generally regarded as best practice to capture the full row of column header labels. This will generally increase the accuracy of your column label extraction. We will do both in this tutorial.
This may seem like you are duplicating your efforts but it is often critical to do both in order for the Tabular Layout Extract Method to map the table's structure and ultimately collect the table's data.
|
|
|
|
|
|
|
Auto Map Labels
|
As you add labels for each Document Type, you may find some documents have labels in common. For example, there are only so many ways to label an invoice number. It might be "Invoice Number", "Invoice No", "Invoice #" or even just "Invoice". Some invoices are going to use one label, others another. When collecting labels for multiple Document Types you can use the "Auto Map" feature to automatically add labels you've previously collected on another Document Type.
|
|
|
Grooper will search the document's text for labels matching those previously collected on other Document Types.
If a match is not found, the Data Element's label is left blank.
As you keep collecting labels for more and more Document Types, the Auto Map feature will pick up more and more labels, allowing you to quickly onboard new Document Types. |
|
|
Be aware, you may still need to validate the auto mapped values and make adjustments.
|
Collect Custom Labels
It's important to keep in mind labels are collected for corresponding Data Elements in a Data Model. You collect one label per Data Element (Data Field, Data Section, Data Table or Data Column). What if you want to collect a label that is distinct from a Data Element, one that doesn't necessarily have to do with a value collected by your Data Model? And why would you even want to?
That's what "Custom Labels" are for. Custom labels serve two primary functions:
- Providing additional labels for classification purposes.
- Providing context labels when a Data Element's label matches multiple points on a document
|
Custom Labels may only be added to Data Model, Data Section or Data Table objects' labels. Put another way, any Data Element in the Data Model's hierarchy that can have child Data Elements can have custom labels. When used for classification purposes, custom labels are typically added to the Data Model itself.
|
|
|
|
|
You may add more Custom Labels to the selected Data Element by repeating the process described above.
|
Custom Labels as Context Labels
|
Some labels are more specific than others. The label "Invoice Date" is more specific than the label "Date". If you see the label "Invoice Date" you know the date you're looking at is the date the invoice was generated. The label "Date" may refer to the invoice's generation date or it could be part of another label like "Due Date". However, some invoice formats will label the invoice date as simply "Date".
This can present a challenge for data extraction. The possibilities for false-positive results tend to crop up the more generic the label used to identify a desired value. There are three separate date values identified by the word "Date" (in full or in part) on this document. |
This is the second reason Custom Labels are typically added for a Document Type, to provide extra context for generic labels, especially when they produce multiple results on a document, leading to false-positive data extraction.
There are two steps to adding and using a Custom Label for this purpose:
- Add the Custom Label.
- Marry the Custom Label with the Data Element's label.
We will refer to this type of a Custom Label as a "Context Label" from here out.
|
The only "trick" to this is adding the Context Label to the appropriate level of the Data Model's hierarchy. Remember, a Custom Label may only be added to a Data Model, Data Section or Data Table object. We cannot add a Custom Label to a Data Field, such as the "Invoice Number" Data Field. To add a Context Label a Data Field can use, we must add the Custom Label to its direct parent Data Element.
|
|
|
|
|
Now that we've added the label, we need to marry the Custom Label with the Data Field its giving extra context to. This is done with the Parent property of a Data Field label.
|
|
|
Use Label Sets for Classification
About Labelset-Based Classification
Label Sets can be used for classifying documents using the Labelset-Based Classification Method. For structured and semi-structured forms labels end up being a way of identifying a document. Without the field data entered, the labels are really what define the document. You know what kind of document you're looking at based on what kind of information is presented and in the case of Labelset-Based classification how that data is labeled. Even when those labels are very similar from one variant to the next, they end up being a thumbprint of that variant. For example, you might use Labelset-Based classification to create Document Types for different variations of invoices from different vendors. The information presented on each variant from each vendor will be more or less the same, and some labels will be more commonly used by different vendors (such as "Invoice Number"). However, if there is enough variation in the set of labels, you can easily differentiate an invoice from one vendor verses another just based on the variation in labels.
|
Take these four "documents". Each one is collecting the same information:
So we might have five Data Fields in our Data Model, one for each piece of information. We'd also collect one label for each Data Field as well. While the data we want from these documents is the same, there is some variation in the labels used for each different document type. If we wanted to distinguish these four documents from each other by classifying using the Labelset-Based Classification Method. This is all done measuring the similarity between the collected label sets for each Document Type.
How is Document Type "C" different from Document Type "A"?
How is Document Type "D" different from Document Type "A"?
|
|
Using the Labelset-Based Classification Method unclassified documents are classified by assigning the document the Document Type whose labels are most similar. The basic concept is "similarity" is determined by how many labels are shared between the unclassified document and the label sets collected for the Document Types in your Content Model. The unclassified document is assigned the Document Type with the highest degree of similarity between matched labels and the Document Types' label sets.
|
The similarity calculation is very straightforward. Grooper searches for labels collected for every Document Type and measures the total character difference between all the labels matched on the document. If each of these five labels is collected for each Document Type's Label Set, you'd have the following character totals for the set.
How similar is Document Type "A" to Document Type "B"?
How similar is Document Type "A" to Document Type "C"?
How similar is Document Type "A" to Document Type "D"?
|
|
|
If we ran one of these "documents" into Grooper, we can see these results very clearly.
|
Configuring Labelset-Based Classification
Next, we will walk through the steps required to enable and configure the Labelset-Based Classification Method, using our example set of invoice documents.
The basic steps are as follows:
- Set the Content Model's Classification Method property to Labelset-Based
- Collect labels for each Document Type
- Test classification
- Reconfigure, updating existing Document Types' Label Sets and adding new Document Types as needed.
Assign the Labelset-Based Classification Method
|
Once you've figured out you want to use Label Sets to classify your documents, you need to tell your Content Model that's what you want to do! This is done by setting the Content Model's Classification Method property to Labelset-Based.
Next, we will collect labels for each Document Type in the Content Model.
|
Collect Labels
|
See the above how to (Collect Label Sets) for a full explanation of how to collect labels for Document Types in a Content Model. The rest of this tutorial will presume you have general familiarity with collecting labels.
|
| ⚠ |
|
Test Classification
In general, regardless of the Classification Method used, one of three things is going to happen to Batch Folders in a Batch during classification.
- The folder will be assigned the correct Document Type.
- The folder will be assigned the wrong Document Type.
- The folder will be assigned no Document Type at all.
The Labelset-Based method is no different. If all folders are classified correctly, that's great. However, testing is all about ensuring this is the case and figuring out where and why problems arise when folders are classified wrong or not classified at all.
We will look at a couple examples of how classification can go wrong using the Labelset-Based method, why that is the case, and what to do about it.
| FYI |
The example Batch in the rest of this tutorial is purposefully small to illustrate a few key points. In the real world, you will want to test using a much larger batch with several examples of each Document Type. |
|
|
|
Now we just need to evaluate the success or failure of our classification. Let's look at a few documents in our Batch before detailing what we will do to resolve any classification errors.
|
|
|
|
|
|
|
|
|
|
What can we do about this? Sometimes you have to know when to stop. Will it be worth it to reconfigure your Content Model and Label Sets to force Grooper to classify this document in one way or another? Probably not. This is more likely than not an extreme outlier, not representative of the larger document set. It may be easier to kick this document (and other outliers) out to human review, especially if reconfiguring the Content Model is going to negatively impact results in other ways. You have to know when to leave well enough alone. Outliers like this are a good example of when to do just that. |
Common Problems and Solutions
Custom Labels to Boost Similarity
Just because we don't have a Data Field for it doesn't mean it's not a useful label for classification. Even though we don't need to extract the salesperson's identification number, the fact that label "Salesperson ID" is present on these invoices could be important. It's another feature that makes up the "Envoy" Document Type. We just need a way of telling Grooper to use this label for classification, even though we can ignore it when it comes time to extract data from these documents. That is one of the reasons for adding custom labels to a Document Type's Label Set. |
|||
|
|||
|
|||
Now that this label is in the Label Set, it will be considered a label during classification. The label's there. It's part of the document, whether we're extracting the value or not. We "tell" Grooper labels like these should be considered features for classification by creating custom labels.
|
|||
|
When we re-classify this Batch, we will see some different results.
|
Adding New Document Types
|
The Labelset-Based classification method makes some assumptions about your document processing approach. It shines with structured and semi-structured forms. Labels, more or less, "stay put" on these kinds of documents. You'll see the same field labels over and over again even though the field values will change from document to document. This presumes your Document Types will be very regular (or rigid, with one Label Set very specifically corresponding to one Document Type). If you encounter a new form or variant of an existing form, you likely will need to account for it with a new Document Type.
Luckily, the process of adding new Document Types and defining their label sets is quick and painless and actually can become easier the more Document Types you add to the Content Model. |
|
|
You can do the whole thing in the "Labels" tab of the Content Model.
|
|
That's it! You've added a new Document Type and collected its Label Set.
|
|
|
As you keep adding more and more Document Types to the Content Model, you will inevitably keep adding more and more labels for the Data Elements in your Data Model. Eventually, you will come across a new document variant that shares a lot of similarity with an already existing Document Type.
|
|
|
This is where the label auto-map functionality comes in handy.
Grooper will search for matching labels already collected in the Label Sets of other Document Types.
|
|
|
|
|
Volatile Labels
|
Sometimes, you will collect a label you do not want to use for classification purposes. Most often, this is because the label may or may not be present depending on the document. For example, some of these invoices from Standard Products have the sales tax totaled on the document. However, some do not. This is called a "Volatile" label. Its presence on a document is unpredictable. Sometimes it's there. Sometimes it's not. It's an optional piece of information. However, because it's optional (or "volatile") we don't actually want to include this as a label for classification. It's going to decrease the similarity score for documents who do not contain the label. |
|
|
|
|
You can indicate these kinds of labels are "volatile" and should not be considered for classification. Whether it's there or not, Grooper will not include it as a feature to measure the similarity between an unclassified document and the Document Type.
|
|
|
| ⚠ |
|
Use Label Sets for Field Based Extraction
Label Sets and the Labeled Value Extractor Type
Intro to The Labeled Value Extractor
For most static field based extraction, the Labeling Behavior leverages the Labeled Value Extractor Type. Let's first briefly examine how Labeled Value works outside of the Labeling Behavior functionality.
As the name implies, Labeled Value extractor is designed to return labeled values. A common feature of structured forms is to divide information across a series of fields. But it's not as if you just have a bunch of data randomly strewn throughout the document. Typically, the field's value will be identified by some kind of label. These labels provide the critical context to what the data refers to.
Labeled Value relies on the spatial relationship between the label and the value. Most often labels and their corresponding values are aligned in one of two ways.
|
1. The value will be to the right of the label. |
|
|
2. The value will be below the label. |
Labeled Value uses two extractors itself, one to find the label and another for the value. If the two extractors results are aligned horizontally or vertically within a certain amount of space (according to how the Labeled Value extractor is configured), the value's result is returned.
|
However, the Labeled Value extractor's set up is a little different when combining it with the Labeling Behavior. The end result is a simpler configuration, utilizing collected labels for the Label Extractor.
Label Sets and Labeled Value
|
Since this Content Model utilizes the Labeling Behavior, at least part of the setup described in the previous tab was unnecessary. If you've collected a label for the Data Field and that Data Field's Value Extractor is set to Labeled Value, there is no need to configure a Label Extractor. Instead, Grooper will pass through the collected label to the Labeled Value extractor.
|
|
|
| ⚠ |
While you can get a result without configuring the Labeled Value extractor's Value Extractor, that doesn't mean you should. It is considered best practice to always configure the Value Extractor. |
Best Practice Considerations
While you can get a result without configuring the Labeled Value extractor's Value Extractor, that doesn't mean you should. It is considered best practice to always configure the Value Extractor.
So, why is it considered best practice to do so. The short answer is to increase the accuracy of your data extraction. A simple segment could be anything. If you know the data you're trying to extract has a certain pattern to it, you should target that data according to its pattern. Dates, for example, follow a few different patterns. Maybe it's "07/20/1969" or "07-20-69" or "July 20, 1969", but you know it's a date because it has a specific syntax or pattern to it. To increase the accuracy of your extraction, you should configure the Value Reader with an extractor that returns the kind of data you're attempting to return.
|
We can see fairly quickly why leaving the Labeled Value extractor's Value Extractor unconfigured is not ideal.
|
|
|
|
|
Configuring the Labeled Value extractor's Value Extractor also gives you the myriad of functionalities available to extractors. For example, Fuzzy RegEx is one of the main ways Grooper gets around poor OCR data at the time of extraction. When the text data is just a couple characters off of the extractor's regex pattern, Fuzzy RegEx can not only match the imperfect data but "swap" the wrong characters for the right ones, effectively cleansing your result.
|
|
|
However, that's just a single character off from being the right result. We could build an extractor to return currency values looking to make fuzzy swaps like this, both matching text that is slightly off and reformatting the result to match a valid currency format. If we used that extractor as the Labeled Value extractor's Value Extractor it would not only find the segment but also reformat the result, swapping the mis-OCR'd period for what it should be, a comma. And we've done just that.
|
Additional Considerations When Using Labeled Value with Label Sets
Custom Labels to Exclude Results
|
Continuing from the tutorial above's discussion of an unconfigured Labeled Value Value Extractor', let's examine the results of the "Purchase Order Number" Data Field.
This is obviously not what we want. We want the purchase order number listed below it. Ultimately, we will follow best practice and configure the Labeled Value extractor's Value Extractor property. However, before we do, this gives us an opportunity to demonstrate some additional functionality of the 'Labeling Behavior. This data "Order Date Customer No. Salesperson Order No. Ship Via" is itself comprised of labels pointing to various values on the document. Even though we haven't set up Data Fields in this Data Model to capture the values they point to, we know this is data we don't want. In general, you don't want to use Grooper to extract labels, you want to extract values. |
What's happening here is Grooper is returning all the text on this single line until a collected label in this Document Type's label set is located. In this case, the label Terms was collected for the "Payment Terms" Data Field. None of the text between the label PO Number and the label Terms have been collected in the label set. So, the Labeled Value extractor returns all the text to the right of the "PO Number" Data Field's label (PO Number) and the next encountered label (Terms), resulting in "Order Date Customer No. Salesperson Order Number Ship Via".
|
⚠ |
This is very specific functionality to the Labeled Value extractor and its interaction with label sets. It will only behave this way if you:
|
|
This may be clearer if we add a Custom Label to the label set.
|
|||
|
|||
|
If we were to go one step further and add a There is no text between the Data Field's label and another label in the label set, the Labeled Value extractor will return absolutely nothing at all.
|
|||
|
HOWEVER, this was not the right solution for this problem. This was only an educational exercise to make you aware of how labels in a label set interact with the Labeled Value extractor when its Value Extractor is left unconfigured.
However, there can be a reasonable solution if you cannot use a Value Extractor and must leave it unconfigured to capture a more generic text segment. This will require the use of "Data Element Overrides". For more information, visit the #Data Element Override Utility section of this article. |
Maximum Noise
|
The Maximum Noise property of the Labeled Value extractor controls the maximum number of "noise characters" allowed in the "bounding-region" of a label-value pair. Now, what does that mean? Let's look at an example, using the "Remit Address" Data Field of our example Data Model.
What gives? It has to do with these "noise characters" mentioned above. |
|
|
Noise characters are any letters and digits falling within the bounding region defined by a label value. For our example, the bounding region looks like this.
|
|
|
The noise characters are any letters or numbers within this rectangle other than the label or the value. The highlighted characters in the image would be the noise characters for our example. The Maximum Noise property allows you to configure how many of these non-label and non-value characters should exist in the bounding box. You don't typically expect to find a bunch of text between a label and a value. The Maximum Noise property acts as an additional filter to avoid returning results too far away from the label. Where the Maximum Distance filters out results that are physically a set distance from the label, the Maximum Noise filters results that have lots of text between them and the label. The default being 5, there can be a maximum of 5 letter or number characters between the label and value. However, in our case, we have more than 5. We have 15 ("FacturaTechnolo").
|
| FYI |
Noise characters are only letters and digits. Spaces, punctuation marks, and control characters are NOT considered noise characters, even if present in the bounding region. |
|
With this in mind, all we need to do to the "Remit Address" Data Field to successfully collect the result at time of extraction is increase the number of allowable noise characters.
|
For Data Field objects, you can collect both a "Header Label" as well as a "Footer Label". As we've seen the Header Label is the text label for whatever field you're trying to extract. Essentially, the text label marks the beginning of the field's content.
The Footer Label is an optional label used to mark the end of the field's content. The Footer Label is useful when leaving the Labeled Value extractor's Value Extractor unconfigured. While it is still always considered best practice to configure the Labeled Value extractor's Value Extractor, there are certain types of data that are difficult to match with regular expression. For example, a person's name. In these types of situations where you must run the Labeled Value extractor without a Value Extractor, a Footer Label can often aid you in throwing out false positive or "junk" data.
|
The following example is manufactured to demonstrate this concept. Let's say we're using Label Sets to extract the "Settlement Agent". We would create a Data Field and collect a label for In the case of this document, we would get the result we wanted. The Labeled Value extractor's Label Extractor would match the collected label (stroked in blue). If left unconfigured, its Value Extractor would return the nearest segment to that label's location (according to its layout settings and operation discussed previously in this tutorial). This is exactly what we want, the highlighted name "Jourdain Meardon". |
|
|
However, what if that value is not present on another document? Such is the case in this image. In that case, the extractor is still going to look for the nearest segment. Depending on the layout settings, you might return "Seller" or you might return "File #". Both of those are segments. However, they are both the wrong result. The correct value in this case is nothing at all. |
|
|
With a Footer Label, we can change how the Labeled Value operates when its Value Extractor is left unconfigured. If we collect |
|
|
Here, we've tested extraction with only the Header Label assigned for the "Settlement Agent" Data Field.
This is junk data. There is no settlement agent listed on the document. No value should be returned. |
|
|
We will add a Footer label to prevent this junk data from returning.
If present on the document, we expect the settlement agent's name to be between the label |
|
|
When we test extraction for the "Settlement Agent" Data Field now, we get very different results.
|
|
⚠ |
This is very specific functionality to the Labeled Value extractor and its interaction with Label Sets. It will only behave this way if you:
|
Using Static Labels for Data Field Extraction
Collecting Static Labels
|
The Data Field elements have a unique label option, the Static label. This label option is useful for situations where the label itself is what you want to extract.
|
|
|
What we really want to do is collect a piece of information that is the same for every single document of one Document Type. We expect the vendor's name "Factura Technology Corp" to be present for every document assigned the "Factura" Document Type during classification. Furthermore, we always expect it to be "Factura Technology Corp" and not something else. Therefore, the vendor's name is "static" for the Document Type. It's present on every Document Type and the same value for every Document Type. You know what else is static on structured and semi-structured forms? Labels! Just in this case the label "Factura Technology Corp" is itself the value we want to return. This is what a Static label is for.
|
Returning the Static Label
|
Now that the Static label is collected, how does Grooper know to return it during extraction when the Extract activity runs? The short answer is the Labeled Value extractor type will do this for us. With "Factura Technology Corp" collected as a Static label, and the "Vendor Name" Data Field configured to utilize the Labeled Value extractor, it will return the Static label itself as the result.
|
Label Sets and the Label Match Extractor Type
About Label Match
The Label Match extractor is extremely similar to the List Match extractor in that it matches one or more items in a defined list. However, it is designed specifically to work with the Labeling Behavior functionality. It will use the fuzzy extraction and vertical and constrained wrapping settings defined on the Content Model if a Labeling Behavior is enabled. This way, you can have a single, unified set of fuzzy match settings for multiple extractors. Rather than configuring these settings, including the confidence score threshold and fuzzy weighting, for multiple extractors, you can configure them just once when enabling the Labeling Behavior and all Label Match extractors will use them.
- For more information on fuzzy extraction, visit the Fuzzy RegEx article.
For the Label Match extractor to return a result, two conditions must be met.
- The document folder must be classified.
- In other words, it must have a Document Type assigned to it.
- That Document Type must have a Labeling Behavior enabled.
- Either on the Document Type or, more typically, its parent Content Model.
Label Match Example
|
Where are these Labeling Behavior settings again?
|
Use Label Sets for Tabular Extraction
Label Sets and the Tabular Layout Method
Label Sets and Tabular Layout
Many tables label the columns so the reader knows what the data in that column corresponds to. How do you know the unit price for an item on an invoice? Typically, that item is in a table and one of the columns of that table is labeled "Unit Price" or something similar. Once you read the labels for each column (also called "column headers"), you the reader know where the table begins (below the column headers) and can identify the data in each row (by understanding what the column headers refer to).
This is also the basic idea behind the Tabular Layout Extraction Method. It too utilizes column header labels to "read" tables on documents, or at least as the step number one in modeling the table's structure so that Grooper can extract data from each cell in the table.
Furthermore, using the Tabular Layout method, collected label sets using a Labeling Behavior can also be used to extract data from tables on documents. In this case, the labels collected for the Data Column children of a Data Table are utilized to help model the table's structure.
Once the column header locations are established, the next requirement is a way to understand how many rows are in the table. This is done by configuring at least one Data Column's Value Extractor property. Generally, there is at least one column in a table that is always present for every row in the table. If you can use an extractor to locate that data below its corresponding column header, that gives you a way of finding each row in the table.
And last there are a few other considerations you might need to make. Is every row in the table a single line or are the rows "multiline"? Do you need to clean up the data the Tabular Layout initially extracts for a column by normalizing it with an extractor? Do you need to establish a table "footer" to limit the number of rows extracted?
This tutorial will cover the basic configuration of the Tabular Layout Extraction Method using collected Label Sets and address a few of these considerations.
|
The basic steps will be as follows:
In a perfect world, you're done at that point. As you can see in this example, we've populated a table. Data is collected for all four Data Columns for each row on the document. However, the world is rarely perfect. We will discuss some further configuration considerations to help you get the most out of this table extraction method in the "Additional Considerations" section below. |
Collect Labels
See the above how to (Collect Label Sets) for a full explanation of how to collect labels for Document Types in a Content Model. The following tutorial will presume you have general familiarity with collecting labels.
|
As far as strict requirements for collecting labels for tabular data extraction goes, you must at minimum collect a label for each Data Column you wish to extract. For this "Stuff and Things" Document Type, one column header label has been collected for each of the four Data Column children of the "Line Items" Data Table.
|
|
|
You may optionally collect a label for the entire row of column header labels. This label is collected for the parent Data Table object's label.
It is generally considered best practice to capture a header row label for the Data Table. But if it's optional, why do it? What is the benefit of this label? |
The answer has to do with imperfect OCR text data and Fuzzy RegEx. Fuzzy RegEx provides a way for regular expression patterns to match in Grooper when the text data doesn't strictly match the pattern. The difference between the regex pattern Grooper and the character string "Gro0per" is just off by a single character. An OCR engine misreading an "o" character for a zero is not uncommon by any means, but a standard regex pattern of Grooper will not match the string "Gro0per". The pattern expects there to be an "o" where there is a zero.
Using Fuzzy RegEx instead of regular regex, Grooper will evaluate the difference between the regex pattern and the string. If it's similar enough (if it falls within a percentage similarity threshold) Grooper will return it as a match.
- FYI "similarity" may also be referred to as "confidence" when evaluating (or scoring) fuzzy match results. Grooper is more or less confident the result matches the regex pattern based on the fuzzy regex similarity between the pattern and the imperfect text data. A similarity of 90% and a confidence score of 90% are functionally the same thing (One could argue there is a difference between these two terms when Fuzzy Match Weightings come into play, but that's a whole different topic. And you may encounter Grooper users who use the terms "similarity" and "confidence" interchangeably regardless. Visit the Fuzzy RegEx article if you would like to learn more).
|
So how does this apply to the Data Table's column header row label? The short answer is it provides a way to increase the accuracy of Data Column column header labels by "boosting" the similarity of the label to imperfect OCR results.
As we will see, capturing the full row of column header labels will boost the similarity, allowing the label to match without altering the Label Behavior's fuzzy match settings. |
|
|
First, notice what's happened when we lassoed the row of column header labels.
|
|
Not magic. Just math. The Data Table's column header row label is much much longer than a single Data Column's column header label. There are just more characters in "Qty. Qty. Item Number Description Unit Price Extended Price\r\nOrd. Shp." than "Description" (70 vs 11). Where the "Description" Data Column's label is roughly 82% similar to the text data (9 out of 11 characters), the "Line Item" Data Table's label, comprised of the whole row of column labels, is roughly 96% similar to the text data (67 out of 70 characters). Utilizing a Data Table label allows you to hijack the whole row's similarity score when a single Data Column does not meet the similarity threshold. If the label can be matched as a part of the larger whole, its confidence score goes up much further than by itself. The Data Table's larger label of the full row of column labels gives extra context to the "Description" Data Column's label, providing more information about what is and is not an appropriate match. So why is it considered best practice to capture a label for the Data Table? OCR errors are unpredictable. The set of examples you worked with when architecting this solution may have been fairly clean with good OCR reads. That may not always be the case. Capturing a Data Table label for the column label row will act as a safety net to avoid unforeseen problems in the future. |
Assign a Data Column's Value Extractor
This step is all about row detection.
So far all we've done is established header column positions on each document. So, Grooper knows where the table "starts". But, that's not where the data is. The table's data is in the rows.
As it stands, Grooper doesn't know anything about the rows in the tables. It doesn't know the size of each row. It doesn't know what kind of data is supposed to be in the rows. Maybe most importantly, it doesn't know how many rows there are. Tables tend to be dynamic. They may have 3 rows on one document and 300 on the next. Grooper needs a way of detecting this.
|
Indeed, if we were to test extraction with just labels collected, we would not get any result whatsoever.
|
|
|
This is why we need a Data Column's Value Extractor property configured, to give the Extract activity an awareness of the rows beneath the column labels. The key thing to keep in mind is this data must be present on every row. You'll want to pick a column whos data is always present for every row, where it would be considered invalid if the information wasn't in that cell for a given row. In our case, we will choose the "Quantity" Data Column. We always expect there to be a quantity listed for the line item on the invoice, even if that quantity is just "1".
|
|
|
This is the pattern we will use for the "Quantity" Data Column's Value Extractor.
We get a bunch of other hits as well. This is a very generic extractor matching very generic numerical data.
|
For fairly simple table structures we now have the two things the Tabular Layout method needs to extract data.
So far, we have:
- Collected labels for the Data Column labels (and optionally the whole row of column labels for the Data Table)
- Configured at least one Data Column with its Value Extractor configured.
Now, all we need to do is tell the Data Table object we want to use the Tabular Layout method. We do this by setting its Extract Method property to Tabular Layout.
Set Extract Method to Tabular Layout and Test
|
A Data Table's extraction method is set using the Extract Method property. To enable the Tabular Layout method, do the following.
|
|
|
Now, let's test out what we have and see what we get!
For the Tabular Layout method, the Data Table is populated using primarily two pieces of information.
|
|
|
With these pieces of information, the Tabular Layout method can start to determine the table's structure. If you know where the columns are and how big they are, and you know how many rows there are, you pretty much know what the table looks like. This allows Grooper to create data instances for each cell in the table.
|
Additional Tabular Layout Considerations
Multiline Rows
|
The table from the previous example had a single-line table structure. Each row occupied one line. Table extraction can get a little trickier when tables have a multiline row structure, especially if sometimes the table rows occupy a single line and sometimes multiple. For example, most the rows on the line items table for the "Arve" Document Type are single-line rows. However, for some rows, the "Description" column spans multiple lines.
|
|
|
This is what the Multiline Rows property is for. Enabling this property will allow you to target table structures like this whose rows extend beyond just a single line on the page.
|
|
|
The Multiline Rows functionality will even detect multiline rows if the lines start on one page and continue to another.
|
|
Generally speaking, a "footer" is something on a document indicating the end of a portion of text. It could be the end of a page, end of a paragraph, end of a chapter, or even the end of something like a table. If you can find some sort of text label that signifies the end of the table, you can often avoid extraction errors where the Tabular Layout method overextends, capturing extra rows of "junk data".
We need a way to remove this row. We need some way of telling Grooper this row is invalid. We can do that very easily in this case with a Footer Label.
|
|
|
To do this, we will need to update the "Factura" Document Type's Label Set.
|
|
|
Data Column Value Extractors
|
As we've seen before at least one Data Column must have its Value Extractor property configured. This is (part of) how the Tabular Layout method models the row structure of the table. Grooper needs some piece of information to tell where each row is (and how many rows to the table there are). In other words, it needs at least one Data Column's Value Extractor results to detect each row.
|
|
|
|
|
A Data Column's Value Extractor may also be used for data validation and/or cleansing purposes as well as finding the rows in a table.
|
|
|
If we assign this extractor to the "Line Total" Data Column's Value Extractor, upon extraction the Data Table will collect the extractor's result, ultimately cleansing the imperfect text data and returning the valid currency amount.
|
|
|
Now, with the "Line Total" Data Column utilizing an extractor, we will get a more accurate result.
|
Manipulating A Column's Extraction Logic
Continuing from the example in the previous tab, let's play "What if...?"
What if for rows with a zero dollar "Unit Price" amount and zero dollar "Line Total" amount, those cells were totally blank instead of "0.00"?
If that were the case, by configuring the "Line Total" Data Column's Value Extractor, we actually would end up throwing out the row entirely, preventing us from collecting data we want.
|
Remember, a Data Column's Value Extractor results are used to detect each row in the table. If there's no result, there's no row. Furthermore, by having two Data Columns' Value Extractors configured, both pieces of information are going to be required to find and return the row.
Oh oh! This means this is no longer a valid row according to the default use of the two Data Columns' Value Extractor results. |
|
Because the "Line Total" Data Column now has its Value Extractor configured, it must match a result in order to collect a table row. This is obviously not want we want. There's still data we want on that row, namely the "Quantity" and "Description" column data. |
|
|
Truly, we only want the "Quantity" Data Column to be used for row detection (We're presuming every row has a value for the "Quantity" column). We don't really want or need the "Line Items" Data Column's Value Extractor to detect the rows (The "Quantity" Data Column's Value Extractor is already doing that job). Instead, we want to perform secondary extraction for the "Line Items" column, once a row is already located (and a row instance is established for each row's data). This is what the Tabular Layout' method's Column Settings properties are for. This set of properties will allow you to choose if and how a Data Column's Value Extractor is used for row detection or secondary extraction after a row instance is formed.
|
|
|
From here, you select the Data Column you wish to configure. These settings will allow you to override the default settings for how a Data Column's Value Extractor is used for row detection and secondary extraction.
|
|
FYI |
The Secondary Extract property will control if secondary extraction is performed at all (Always or Never). The Secondary Extract Mode will control how secondary extraction is performed. The Data Column's Value Extractor can run against the cell in the row's text (CellExtract), it can run against the full row's text (RowExtract), or the extractor can be ignored entirely and the text falling within the geometric boundaries of the cell will be returned (Geometric). |
By manipulating the extraction logic of the Data Columns, we have more control over how those results are used for row detection or secondary extraction for data validation and cleansing. |
Label Sets and the Row Match Method
The Row Match table extraction method uses an extractor to pattern match each row of a table. For each result the extractor returns, the Data Table will collect one row. So, if the extractor returns forty results, you'll end up with a table with forty rows. Data Column results are then populated by filtering the data within each row to the proper column. Commonly, an array based collation method will be used to return the full row, then elements of that array will form the column results.
Label Sets can also be used in conjunction with the Row Match method. The Data Table's Header and/or Footer Labels can be leveraged to narrow where the Row Extractor executes on the document.
|
For example, take this Closing Disclosure form. This table structure could be targeted using the Tabular Layout method. However, it is more easily targeted and returned using the Row Match Method. The only potential problem is differentiating between the different sections of tables, such as "A. Origination Charges" and "B. Services Borrower Did Not Shop For" and "C. Services Borrower Did Shop For". The row structure of each of these highlighted tables is similar enough (or identical in the case of the "B" and "C" tables) that a single extractor could easily produce false positive matches. |
|
|
However, the labels on this document clearly define where each table begins and ends. How do you the reader know where the "B. Services Borrower Did Not Shop For" table starts? You read the label "B. Services Borrower Did Not Shop For". This is its header label. How do you the reader know where the "B. Services Borrower Did Not Shop For" table ends? Once you find the label "C. Services Borrower Did Shop For", you know you're looking at a different table. This is the "B" table's footer label. The Row Match method will utilize a Data Table's collected Header' and Footer labels to define the table's boundaries. The Row Extractor will only return row instances following the Header and/or before the Footer. |
Example Row Extractor
|
|
|
|
|
Collecting Labels For Row Match
Using Label Sets, you can assign a Header and/or Footer label for a Data Table. The Row Match method will utilize these labels in the place of its Header Extractor and Footer Extractor, respectively.
|
|
|
More importantly, in our case, we need a Footer label. The Footer label will determine where the table "stops" on the document. For the Row Match method, no row instances will be collected after this label is encountered on the document. Any matches from our Row Extractor will be discarded after the collected label.
That's it! That's all you need to do to establish the table's header and footer. There is no need to collect labels for the Data Columns. Collecting labels for Data Columns is only necessary for the Tabular Layout method. I will repeat. The Row Match method will only utilize the Data Table's labels. If you collect labels for the Data Columns and you're using the Row Match method, they will do nothing as far as table extraction goes. |
Enable Label Sets
|
The only thing left to do is "tell" the Row Match method you want to use the Header and Footer labels. This is done by enabling the Use Labelset property.
|
|
Test ResultsWith the labels collected and the Use Labelset property enabled, our Data Table will properly collect the rows we want from this table.
|
The Fluid Layout Method
The Fluid Layout table extraction method is designed to switch between the Tabular Layout method and the Row Match method, depending on how a Data Table's labels are configured. So, if you have a varied set of documents where Tabular Layout works well for some Document Types and Row Match works well for other Document Types, you may be able to use Fluid Layout for all of them, avoiding the need for Data Element Overrides.
| ⚠ | Label Sets must be collected to use the Fluid Layout method. Each Document Type will use either Tabular Layout or a Row Extractor to collect table data depending on how the labels for a Data Table are collected. Therefore, you cannot utilize the Fluid Layout method without a Labeling Behavior enabled.
The Fluid Layout table extraction method is not only "Label Set aware", it is Label Set dependent. |
|
For example, take these two versions of code descriptions from an EOB form. Version 1 is clearly a table. It uses the labels "CODE" and "DESCRIPTION" to delineate between each column. The Tabular Layout table extraction method would handily extract this information, returning everything in the "CODE" column to one Data Column and everything in the "DESCRIPTION" column to another.. |
|
|
Version 2 is not exactly a table, but a Data Table could still use the Row Extract method to form a table from this list of information. Each bulleted item in the list could be returned as a table row. The code could be filtered into one Data Column and the description could be filtered into another. You could not use the Tabular Layout method for this "table". There are no column labels present.
|
So, we have a situation where the Tabular Layout or the Row Match method is preferable, depending on the document's layout. Next, we will review how to configure the Fluid Layout table extraction method to target both table structures.
Collect Labels
The first thing you will want to do is collect labels for your Data Table for each document type. How the labels are collected will determine which table extraction method the Fluid Layout method executes.
- To execute the Tabular Layout method, the Data Table's Data Column Header labels must be collected.
- Optionally, you may choose to collect a Header and/or Footer label for the Data Table.
- To execute the Row Match method (also referred to as the Flow Layout), you must collect the Data Table's Header label. You may NOT collect labels for the Data Table's Data Column labels.
- This will be how Grooper checks to see which extraction method is used for each Document Type. If Data Column labels are present, the Tabular Layout configuration is used. If no Data Column labels are present, but the Data Table's Header label is present, it will use the Flow Layout (i.e. Row Match) configuration is used.
- Optionally, you may choose to collect a Footer label for the Data Table.
|
For Tabular Layout Document Types
The "V1" Document Type will utilize the Fluid Layout method's Tabular Layout configuration. To execute the Tabular Layout configuration, much like executing the Tabular Layout table extraction method in general, Data Column labels must be collected.
|
|||
|
For Flow Layout Document Types
The "V2" Document Type will utilize the Fluid Layout method's Flow Layout configuration. This will utilize the Row Match method to return table data. To execute the Flow Layout configuration ONLY the Data Table's label must be collected.
|
Configure Fluid Layout
|
Now that the labels are collected for our Document Types we can configure the Fluid Layout extraction method for our Data Table.
|
|
|
Expanding the Fluid Layout sub-properties, you can see there are two Layout configurations.
|
|
|
By expanding the Tabular Layout and Flow Layout properties, you can see their property panels are identical to the Tabular Layout and Row Match table extraction methods respectively.
All that's left is to configure extraction logic for each of the Layouts. |
Configure Flow Layout
|
The Flow Layout configuration extracts table data using the Row Match method. What do you need in order for Row Match to collect table data? A Row Extractor.
For our purposes, that's all we need to do. For the "V2 - Row Match" Document Types this extractor will properly return each row and collect each columns data. We have no need to configure any of the other Row Match properties. |
Configure Tabular Layout
|
The Tabular Layout configuration extracts table data using the Tabular Layout method. What do you need in order for Tabular Layout to collect table data? At least one Data Column's Value Extractor must be configured in order to detect each row in the table.
This is a fairly simple table with only two columns. Just configuring one Data Column's Value Extractor will be sufficient for our needs. |
|
|
Test Extraction
Now that extraction is configured for both the 'Tabular Layout and Flow Layout for our documents, Grooper will switch between the Tabular Layout table extraction method and the Row Match table extraction methods, depending on the Document Type.
|
For the "V1 - Tabular Layout" Document Type, Data Column labels were collected. Therefore Grooper extracts the table using the Tabular Layout configuration. |
|
|
For the "V2 - Row Match" Document Type, only the Data Table's Header label was collected, and no Data Column labels were collected. Therefore, Grooper extracts the table using the Flow Layout configuration (using the Row Match method). |
Use Label Sets for Sectional Extraction
There are two Label Set aware Extract Methods for Data Sections.
- Transaction Detection
- Nested Table
The Transaction Detection method will be most applicable to the majority of use cases wanting to use labels to produce section instances. If you simply want to produce a section starting at a header label and ending at a footer label, the Transaction Detection method is what you want. However, this configuration of Transaction Detection is quite different from how it normally produces sections. We will go over how Transaction Detection establishes section instances both with and without Label Sets.
The Nested Table method is a much more niche section extraction method. It produces section instances using repeating tables, nested within each section. This can be a highly effective way to target sections for certain use cases, such as medical EOB (explanation of benefits) forms.
Label Sets and the Transaction Detection Method
About Transaction Detection
|
The Transaction Detection section extraction method is useful for semi-structured documents which have multiple sections which are themselves very structured, repeating the same (or at least very similar) field or table data. For example, take this monthly tax reporting form from the Oklahoma Tax Commission. There are five sections of information on this document listed as "A" "B" "C" "D" and "E". Each of these sections collect the exact same set of data:
The Transaction Detection method looks for periodic similarity (also referred to as "periodicity") to sub-divide a document into multiple section instances. |
|||
|
For structured information like this, one way you can define where each section starts and stops is just by the patterns in the fields themselves. These values are periodic. They appear at set intervals, marking the boundaries of each section. For example,
The Transaction Detection method detects the periodic patterns in these values to divide up the document into sections, forming one section instance from each periodic pattern of data detected. Part of how the Transaction Detection detects these patterns is by using extractors configured in the Data Section's child Data Field objects. These are called Binding Fields. Grooper uses the results matched by these Data Fields to detect each periodic section. For example, you might have a "Production Unit Number" Data Field for these section that returns five results, one for each section. Once these five results are established, Grooper will look for other patterns around these results to establish the boundaries of each of the five sections. |
|||
|
The Transaction Detection method also analyzes a document's text line-by-line looking for repeated lines that are highly similar to each other. For example, each of the yellow highlighted lines are extremely similar. They are essentially identical except for the starting character on each line (either "A" "B" "C" "D" or "E"), this repeated pattern is a good indication that we have a set of repeated (or "periodic") sections of information. Furthermore, the next lines, highlighted in blue, are also similar as long as you normalize the data a bit. If you replace the specific number with just a "#" symbol, they too are nearly identical. The Transaction Detection method will further go line-by-line comparing the text on each one to subsequent lines, looking for repeating patterns. Such is the case for the rest of the green highlighted lines. Even accounting for OCR errors, each line is similar enough to detect a pattern. We have 5 sets of very similar lines of text. We have ultimately 5 section instances returned for the Data Section. Lastly, eventually Grooper will detect a line that does not fit the pattern. The red highlighted line is totally dissimilar from the set of similar lines detected previously. This is where Grooper "knows" to stop. Not fitting the periodic pattern, this marks a stopping point. This text is left out of the last section instance and with no further lines matching the detected periodic pattern, no further section instances are returned.
|
| FYI |
What does this have to do with Labeling Behavior and Label Sets? We're getting there. Ultimately, Transaction Detection is "Label Set aware" and can take advantage of collected Header and Footer labels for a Data Section object. However, collecting labels for the Data Section will quite dramatically change how Transaction Detection works. It is best to understand how this sectioning method works without Label Sets before we delve into how it works with them. |
Configuring Transaction Detection with Binding Fields
|
Without utilizing Label Sets, the Transaction Detection sectioning method must assign at least one Binding Field in order to detect the periodic similarity among lines of text in a document, ultimately forming the Data Section's section instances.
|
|||
|
Next, we will configure a the "Production Info" Data Section to create section instances using the Transaction Detection method.
|
|||
|
The Transaction Detection method will then go through the line-by-line comparison process around the Binding Fields to detect periodic similarities to produce section instances.
|
|||
|
Configuring Transaction Detection with Label Sets
Now that we understand the basics of the Transaction Detection method, we can look at how this sectioning method interacts with the Labeling Behavior. Its behavior is wildly different if a Header label is collected for the Data Section. Assuming you can collect a Header label for the Data Section, it is so different that a Binding Field is not even necessary to produce the section instances.
Establishing the section instances is almost as simple as...
- Start the section instance at the 'Header label.
- Stop the section instance at the next Header label (or Footer label)
- Repeat for every Header label found on the document.
|
For example, we have collected a Header label for the "Production Info" Data Section here.
|
|
|
Next, we will configure a the "Production Info" Data Section to create section instances using the Transaction Detection method.
|
Label Sets and the Nested Table Method

The Nested Table Data Section Extract Method was specifically designed for a particular combination of sectional and tabular data found on medical EOB (Explanation of Benefits) forms (However, it may have application in other areas). These documents will often be broken up into sections of claims with information like a claim number and a patient's name followed by a table of information pertaining to services rendered and ending with some kind of footer, usually a total line adding up amounts in the table.
One way you can often identify where these claim sections start and stop are by the table's themselves. Essentially you'll have one table per claim. Or in Grooper's case, one Data Table instance per Data Section instance.
The Nested Table sectioning method takes advantage of these kinds of structures, utilizing a Document Type's Label Set to do so.
|
The Nested Table method has two hard requirements:
|
Set Data Section's Extract Method to Nested Table
|
The Nested Table method is a little different in that it is a sectional extraction method but also involves tabular data. Ultimately, both a Data Section object and a Data Table object are required for it to work. However, it is primarily a method of breaking up a document into multiple sections for data extraction purposes. As such, it is a Data Section extraction method.
|
Configure the Data Table
The Data Table should be configured to collect all table rows for the full document. When configuring the Data Table and testing its results, just ensure the table accurately extracts the full document as a single table. The Data Section (using the Nested Table method) will take care of breaking out the rows into the appropriate sections.
| ⚠ | It is considered best practice for the child Data Table to use the Tabular Layout method.
The Nested Table method was designed specifically in mind with with a Data Table using the Tabular Layout table extraction method. While technically possible to use other table extraction methods, you will achieve the best results when the Data Table uses the Tabular Layout method. |
|
|||
|
In order for the Nested Table method to properly section out the document, you must assign a Footer label in the Document Type's Label Set for the Data Section. This will give Grooper an awareness of where the section should stop (ultimately allowing another section to start). In our case, we can use the text label "Totals". At the end of every table/section there is this "Totals" line totaling up various columns in the table. Since this label is present at the end of every section, we can collect it as the Data Section's Footer label, which the Nested Table method will then use to establish where each section instance ends.
|
Run Global Header Detection
|
If we are to test extraction at this point, we will see mixed results. The Data Section will correctly produce the three section instances for this document. However, the tabular data will not be collected.
|
|
|
As a side note, the "Inspector" tab can be very helpful when troubleshooting extraction in general, but particularly Data Section and Data Table extraction.
If you were to select each section instance, you could verify at this point all three sections where established successfully and the text data for each table is present. It just wasn't extracted. Why not? |
|
|
This has to do with where the sections instances are and where the Data Table and Data Column labels are.
This presents a challenge. The Tabular Layout table extraction method relies on these labels in order to extract the tabular data. As is, Grooper can't "see" outside of the section instances. If only Grooper could look up to the table and column labels, the table would extract with ease. |
|
|
Luckily, there is a way for the Tabular Layout method to do just that, using the Run Global property.
This will allow Grooper to detect Data Table and Data Column labels outside of the section instances. Perfect for what we're trying to do here. What's going to happen when we test extraction now? Find out in the next tab! |
Test For Success
|
With the child Data Table now using global header detection (by setting the Run Global property to True), it can look outside the section instances for the column header labels on the full document. Let's see how our sections extract now and if we get any table data populated.
Success! The Run Global property method is extremely beneficial when trying to extract table data from multiple sections. Without it, Tabular Layout would not have any way of referring to the column header labels collected in the Label Set. With this property enabled, Tabular Layout can do something very atypical for sectional data extraction. It can look beyond a section instance's text data and refer to the full document (in this case to locate the Data Table and Data Column labels in the Label Sets).
|
Bonus Info: Hierarchical Tables and Peer Parent Labels
Additional Information
Label Layout Options
As we've been collecting labels, you may have noticed the Layout property change from Simple to Tabbed or Wrapped. The Layout property determines how the label's text is presented on the document. The Layout can be one of the six following options:
- Simple
- Tabbed
- Substring
- Boxed
- Wrapped
- Ruled
When collecting labels in the Labels tab, Grooper will automatically detect the appropriate label layout. However, there may be some circumstances where you need to manually select the label's layout. Next, we will describe each of these Layout options.
Simple
The Simple Layout is by far the most common. Most fields on a document will utilize this layout. These labels consist of words that do not cross "segment boundaries", meaning the words themselves are not separated by large amounts of whitespace like tabs or by terminal characters like a colon (as a colon character often marks the end of a label).
|
Tabbed
The Tabbed Layout is used for situations where you do want to cross segment boundaries. Think about capturing a table's row of header labels. Often each column's label will be separated by large amounts of whitespace. The Simple Layout would not permit you to capture the table's header but Tabbed will.
|
Substring
The Substring Layout is intended for circumstances where a label is bookended between other portions of text. In other words, it is a "substring" of a larger string of text.
|
|
|
This is a situation where we would want to manually assign the label's Layout, if we want to collect substrings as labels.
|
Wrapped
The Wrapped Layout will return labels that wrap full lines of text on a document. So, if a label starts on one line, then continues on one or more lines after, this layout will successfully return it. The Wrapped Layout was also useful when we were collecting table labels for the entire header row. For those tables who had column headers on multiple lines, this layout was most appropriate to return the whole row of column headers.
- However, the Simple Layout will not work to capture this portion of text as a label.
- Normally, Grooper would capture this text as a Wrapped 'Label, but we manually assigned it the Simple Layout.
|
|
|
|
|
|
|
Ruled
Lines are used on documents to divide fields, sections, table columns or otherwise distinguish between one piece of information and another. Because of this, it is atypical to find a stacked label with a line between the first and second label. The Simple Layout respects this by preventing labels from returning if a horizontal line falls between any portion of the stacked label.
However, there may be rare circumstances where a horizontal line does fall between portions of the stacked label. In that case, you will want to use the Ruled Layout.
| ⚠ | Line location information must be present in the Layout Data in order for Grooper to determine if a line is present. A Line Detection or Line Removal command must have been previously executed by an IP Profile during Image Processing or Recognize to obtain this information. |
|
|
|
| ⚠ |
|
Boxed
The Boxed Layout is intended to capture labels that wrap inside a box, enclosed in lines on all four sides. You can use this Layout to distinguish between labels that fall inside a box and those that do not when the Vertical Wrap property is disabled.
| ⚠ | Line location information must be present in the Layout Data in order for Grooper to determine if a line is present. A Line Detection or Line Removal command must have been previously executed by an IP Profile during Image Processing or Recognize to obtain this information. |
|
|
|
You can differentiate between a label in a box and one outside a box by disabling the Vertical Wrap property.
|
|
|
|
|
| ⚠ |
|
Data Element Override Utility
Earlier in this article, we talked about using the Labeled Value Extractor Type without configuring its Value Extractor. Again, it is considered best practice to configure its Value Extractor. However, sometimes data is difficult to pattern match. For example, crafting an extractor to return people or company names can be difficult to craft. It is truly these cases why the option to leave a Labeled Value extractor's Value Extractor unconfigured is an option with Label Sets.
To make the best use of this functionality, Data Element Overrides are typically necessary. Indeed, because the Label Set approach is more templated in nature, Data Element Overrides can be a useful tool to fine tune extraction for one specific Document Type. In this section, we will use the "Purchase Order Number" Data Field of our "Labeling Behavior - Invoices - Model" Content Model to demonstrate this.
Revisiting the Problem
|
The problem arose due to how the Labeled Value extractor behaves when its Value Extractor is left unconfigured. For some of our invoices, this didn't really present a problem at all.
|
|
|
For certain document layouts, this approach works just fine.
|
|
|
However, this will not be the case for all document layouts, notably those whose labels are stacked vertically on top of their corresponding value.
|
|
|
However, we can easily get this extractor to return the actual purchase order number. All we have to do is tell it not to look to the right of the label.
|
|
|
But, what about our documents that did have the purchase order number laid out to the right of the label?
Data Element configurations are globally applied to all Document Types which inherit them. In our case, all our Document Types inherit the Content Model's Data Model (and its child Data Elements, such as our "Purchase Order Number" Data Field). Therefore, the changes we make to the "Purchase Order Number" Data Field's extractor will effect all documents of all Document Types. It's simply going to execute as we configure it, regardless which specific Document Type is extracted. We're really in a situation where we want one Document Type to use one configuration and another Document Type to use a slightly different configuration. This is exactly what "Data Element Overrides" are for. |
Data Element Override Basics
|
Before we get into setting up "Data Element Overrides", we will rewind a bit and set our Labeled Value extractor's Maximum Distance properties back to the default settings.
|
|
|
What we want to do here is change how these properties as configured for the "Purchase Order Number" Data Field are configured for the "Factura" Document Type and ONLY for the the "Factura" Document Type. "Data Element Overrides" allow us to do this by overriding a Data Element's property settings for a specific Document Type (in our case the "Purchase Order Number" Data Field for the "Factura" Document Type.). "Data Element Overrides" are configured using the Document Type object to which they will be applied. We will thus configure an override for the "Factura" Document Type.
|
|
|
|
|
Using the "Property Overrides" UI, any property configuration we edit will only apply to the selected Document Type (in our case, the "Factura" Document Type).
Now the "Purchase Order Number" Data Field will extract using these settings, only for the "Factura" Document Type.
|
|
|
|
Data Element Overrides can be an effective way of fine tuning extraction logic specific to an individual Document Type. Because the Label Set approach is more templated in nature, each Document Type corresponds to one specific format, meaning the document's layout will be consisted for each folder classified as that Document Type. Many users will take advantage of this and leverage Data Element Overrides for various fields on various Document Types, especially when utilizing Label Sets. There is a shortcut to configuring Data Element Overrides using the "Labels" collection UI, which we will demonstrate in the next tab. |
Overrides & the Labels UI
In the previous tab, we taught you the normal way to configure Data Element Overrides for a Document Type. You can configure overrides in this manner whether or not you're using a Labeling Behavior in your Content Model. If you are using a Labeling Behavior, there is a shortcut to edit overrides for a Data Element. You can do it directly from the "Labels" tab, using the same UI you use to collect labels.
|
|
|
So, we need an override for the "Purchase Order Number" Data Field for the "Envoy" Document Type, which we can do without leaving the Labels UI.
|
|
Furthermore, you can test the override directly from the Labels UI as well. You can actually test extraction for the whole Data Model!
|
|
|
Version Differences
2021
The Labeling Behavior is brand new functionality in Grooper version 2021. Prior to this version, its functionality may have been able to be approximated by other objects and their properties (For example, a Data Type using the Key-Value Pair collation is at least in some ways similar to how the Labeled Value Extractor Type works). However, creation of label sets using Document Types and their implementation described above was not available prior to version 2021.



































































































































































































