2023:Transaction Detection (Section Extract Method): Difference between revisions

From Grooper Wiki
No edit summary
Tag: Reverted
No edit summary
 
(20 intermediate revisions by 3 users not shown)
Line 1: Line 1:
{|class="wip-box"
{{AutoVersion}}
|
'''WIP'''
|
This article is a work-in-progress or created as a placeholder for testing purposes.  This article is subject to change and/or expansion.  It may be incomplete, inaccurate, or stop abruptly.
 
This tag will be removed upon draft completion.
|}


<blockquote style="font-size:14pt">
<blockquote>{{#lst:Glossary|Transaction Detection}}</blockquote>
'''''Transaction Detection''''' is a '''[[Data Section]]''' '''''Extract Method'''''.  This extraction method produces section instances by detecting repeating patterns of text around the '''Data Section's''' child '''Data Fields'''.
</blockquote>


This '''Data Section''' '''''Extract Method''''' is unique in that it leverages the '''Data Section's''' child '''Data Fields'''' extracted values to produce the section instances.  Normally, the section instance is established first, then the '''Data Fields'''' extractors returns results from the text data in the instance.  In the case of '''''Transaction Detection''''', one or more '''Data Fields''' are specified as '''''"Binding Fields"'''''.  These '''''Binding Fields''''' extract against the whole document to detect periodic patterns to establish each section instance.
This '''Data Section''' '''''Extract Method''''' is unique in that it leverages the '''Data Section's''' child '''Data Fields'''' extracted values to produce the section instances.  Normally, the section instance is established first, then the '''Data Fields'''' extractors returns results from the text data in the instance.  In the case of '''''Transaction Detection''''', one or more '''Data Fields''' are specified as '''''"Binding Fields"'''''.  These '''''Binding Fields''''' extract against the whole document to detect periodic patterns to establish each section instance.


'''''Transaction Detection''''' is also a "Label Set aware" extraction technique.  It can use collected labels to more strictly define where each section instance is located.  For more information on Label Sets, visit the [[Label Sets]] article.
'''''Transaction Detection''''' is also a "Label Set aware" extraction technique.  It can use collected labels to more strictly define where each section instance is located.  For more information on Label Sets, visit the [[Label Sets]] article.
{| class="wikitable" style="margin:left"
! Previous Versions
|-
|
[[Transaction Detection - 2021|Grooper 2021]]
<br>
|}


<!--#region About-->
<!--#region About-->
=== About ===
{|class="download-box"
 
{|cellpadding=10 cellspacing=5
|valign=top style="width:60%"|
The ''Transaction Detection'' section extraction method is useful for semi-structured documents which have multiple sections which are themselves very structured, repeating the same (or at least very similar) field or table data.
 
For example, take this monthly tax reporting form from the Oklahoma Tax Commission.
 
There are five sections of information on this document listed as "A" "B" "C" "D" and "E".  Each of these sections collect the exact same set of data:
# A "Production Unit Number" assigned to an oil or natural gas well.
# A "Purchaser/Producers Report Number"
# The "Gross Volume" of oil or natural gas produced
# The "Gross Value" dollar amount of oil or natural gas produced
# The "Qualifying Tax Rate" ultimately used to calculate the tax due for the well's production.
# And so on.
 
The ''Transaction Detection'' method looks for ''periodic similarity'' (also referred to as "periodicity") to sub-divide a document into multiple section instances.
|
|
[[File:Data-section-divder-01.png]]
[[File:Asset 22@4x.png]]
|-
|valign=top|
For structured information like this, one way you can define where each section starts and stops is just by the patterns in the fields themselves.  These values are ''periodic''.  They appear at set intervals, marking the boundaries of each section.
 
For example,
# The "Production Unit Number" is always found at the start of the section.
# The "Exempt Volume" is always found somewhere in the middle of the section.
# The "Petroleum Excise Tax Due" is always found at the end.
 
The ''Transaction Detection'' method detects the periodic patterns in these values to divide up the document into sections, forming one section instance from each periodic pattern of data detected.  Part of how the ''Transaction Detection'' detects these patterns is by using extractors configured in the '''Data Section's''' child '''Data Field''' objects.  These are called '''''Binding Fields'''''.
 
'''Grooper''' uses the results matched by these '''Data Fields''' to detect each periodic section.  For example, you might have a "Production Unit Number" '''Data Field''' for these section that returns five results, one for each section.  Once these five results are established, '''Grooper''' will look for other patterns around these results to establish the boundaries of each of the five sections.
|
|
[[File:Transaction-detection-about-01.png]]
You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023). The first contains one or more '''Batches''' of sample documents.  The second contains one or more '''Projects''' with resources used in examples throughout this article.  
|-
* [[Media:2023 Wiki Transaction-Detection Batches.zip]]
|valign=top|
* [[Media:2023 Wiki Transaction-Detection Project.zip]]
The ''Transaction Detection'' method also analyzes a document's text line-by-line looking for repeated lines that are highly similar to each other.
 
For example, each of the yellow highlighted lines are ''extremely'' similar.  They are essentially identical except for the starting character on each line (either "A" "B" "C" "D" or "E"), this repeated pattern is a good indication that we have a set of repeated (or "periodic") sections of information.
 
Furthermore, the next lines, highlighted in blue, are also similar as long as you normalize the data a bit.  If you replace the specific number with just a "#" symbol, they too are nearly identical.
 
The ''Transaction Detection'' method will further go line-by-line comparing the text on each one to subsequent lines, looking for repeating patterns.  Such is the case for the rest of the green highlighted lines.  Even accounting for OCR errors, each line is similar enough to detect a pattern.  We have 5 sets of very similar lines of text.  We have ultimately 5 section instances returned for the '''Data Section'''.
 
Lastly, eventually '''Grooper''' will detect a line that does ''not'' fit the pattern.  The red highlighted line is totally dissimilar from the set of similar lines detected previously.  This is where '''Grooper''' "knows" to stop.  Not fitting the periodic pattern, this marks a stopping point.  This text is left out of the last section instance and with no further lines matching the detected periodic pattern, no further section instances are returned.
 
{|cellpadding="10" cellspacing="5"
|-style="background-color:#f89420; color:white"
|style="font-size:22pt"|'''&#9888;'''
|
The ''Transaction Detection'' method is not going to work well for every use case.  It succeeds best where most of the data in the section is numerical in nature.
 
It's easy to normalize numeric data.  Any individual number can be swapped for a "#" symbol.  A currency value in on a line of text one section could be $988,000.00 and $112,433.00 in another but as far as comparing the lines for periodic similarity (also referred to as "periodicity"), they can both be normalized as "$###,###.##".  Lexical data tends to be trickier.  How do you normalize a name for example?  How do you differentiate a name from a field label?  You can do it with a variety of extraction techniques, but ''not'' using this line-by-line approach to determining how similar one line is to another. 
 
This precisely is why it's called "Transaction" Detection. It works best with transactional data, which tends to be currency, quantity or otherwise numerical values.  Indeed, this method was specifically designed for EOB (Explanation of Benefit) form processing and medical provider payment automation, in general.
|}
|}
|
[[File:Transaction-detection-about-03.png]]
|}
<!--#endregion-->
<!--#region Configuring Transaction Detection with Binding Fields-->
== Configuring Transaction Detection with Binding Fields ==
{|cellpadding=10 cellspacing=5
|valign=top style="width:40%"|
Without utilizing Label Sets, the ''Transaction Detection'' sectioning method must assign ''at least one'' '''''Binding Field''''' in order to detect the periodic similarity among lines of text in a document, ultimately forming the '''Data Section's''' section instances.
# For this example, we will end up configuring the "Production" '''Data Section''' of this '''Data Model'''.
# We will utilize the "Production Unit Number" as the '''''Binding Field'''''.
# This '''Data Field''' utilizes a simple ''Pattern Match'' extractor for its '''''Value Extractor''''' assignment.
#* Which returns the production unit numbers on the document using a simple pattern <code>\d{3}-\d{6}-\d-\d{3}</code>
|
[[File:2023 Transaction Detection - 2023 01 About 01.png]]
|-
|valign=top|
# Importantly, notice this returns '''five''' result candidates (when testing extraction at the '''Data Field''' level in the Node Tree).
#* This will be important because we want to end up creating five section instances.  If you expect to return five section instances, your '''''Binding Field's''''' extractor (or '''''Binding Fields''''' extractors if using more than one) will need to return five results.
|
[[File:2023 Transaction Detection - 2023 01 About 02.png]]
|-
|valign=top|
Next, we will configure a the "Production" '''Data Section''' to create section instances using the ''Transaction Detection'' method.
# Select the '''Data Section''' in the Node Tree.
# Using the '''''Extract Method''''' property, select ''Transaction Detection''.
# Click the ellipsis button to the right of the '''''Binding Fields''''' property.
# This will bring up a new window. Select which '''Data Fields''' in the '''Data Section''' should be used as '''''Binding Fields''''' by checking the box next to the '''Data Field'''.
#* Here, we have selected the "Production Unit Number" '''Data Field'''.
{|cellpadding="10" cellspacing="5"
|-style="background-color:#f89420; color:white"
|style="font-size:22pt"|'''&#9888;'''
|
For this example, all we need to do is assign this single '''Data Field''' as a '''''Binding Field'''''.  There is enough similarity between the repeating section, that's all we need to do (For more complicated situations you may need multiple binding fields.  Just be sure ''all'' '''''Binding Fields''''' are present in ''each'' section.  No "optional" '''Data Fields''' for the '''''Binding Fields'''''.
|}
|
[[File:2023 Transaction Detection - 2023 01 About 03.png]]
|-
|valign=top|
The ''Transaction Detection'' method will then go through the line-by-line comparison process around the '''''Binding Fields''''' to detect periodic similarities to produce section instances.
# How '''Grooper''' goes about detecting these periodic patterns is controlled by the '''''Periodicity Detection''''' set of properties.
|
[[File:2023 Transaction Detection - 2023 01 About 04.png]]
|-
|valign=top|
# In our case, five section instances were established, one for the each result from the "Production Unit Number" '''Data Field's''' '''''Value Extractor'''''.
# If you would like, feel free to click on the "Diagnostics" icon to open up the "Diagnostics" in a new tabbed window.
If you need to trouble shoot the ''Transaction Detection'' method's results, the "Diagnostics" can give you additional information as to how '''Grooper''' detected these repeating patterns in the document's text data.
|
[[File:2023 Transaction Detection - 2023 01 About 05.png]]
|-
|valign=top|
{|cellpadding="10" cellspacing="5"
|-style="background-color:#36b0a7; color:white"
|style="font-size:14pt"|'''FYI'''
|
:1. In "Diagnostics", you will find the following reports for the '''Data Section'''
:: 1. Execution Log
:: 2. Preprocessed Document
:: 3. Periodicity Matrix
|}
|
[[File:2023 Transaction Detection - 2023 01 About 06.png]]
|}
<!--#endregion-->
<!--#region Configuring Transaction Detection with Label Sets-->
{{#lst:Labeling Behavior - 2023|Transaction_Detection_Label_Sets}}

Latest revision as of 10:44, 22 November 2024

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

202520232021

Transaction Detection is a insert_page_break Data Section Extract Method. This extraction method produces section instances by detecting repeating patterns of text around the Data Section's child variables Data Fields.

This Data Section Extract Method is unique in that it leverages the Data Section's child Data Fields' extracted values to produce the section instances. Normally, the section instance is established first, then the Data Fields' extractors returns results from the text data in the instance. In the case of Transaction Detection, one or more Data Fields are specified as "Binding Fields". These Binding Fields extract against the whole document to detect periodic patterns to establish each section instance.

Transaction Detection is also a "Label Set aware" extraction technique. It can use collected labels to more strictly define where each section instance is located. For more information on Label Sets, visit the Label Sets article.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.