2023:Transaction Detection (Section Extract Method): Difference between revisions
Created page with "{|class="wip-box" | '''WIP''' | This article is a work-in-progress or created as a placeholder for testing purposes. This article is subject to change and/or expansion. It may be incomplete, inaccurate, or stop abruptly. This tag will be removed upon draft completion. |} <blockquote style="font-size:14pt"> '''''Transaction Detection''''' is a '''Data Section''' '''''Extract Method'''''. This extraction method produces section instances by detecting repeating pat..." |
No edit summary |
||
| Line 15: | Line 15: | ||
'''''Transaction Detection''''' is also a "Label Set aware" extraction technique. It can use collected labels to more strictly define where each section instance is located. For more information on Label Sets, visit the [[Label Sets]] article. | '''''Transaction Detection''''' is also a "Label Set aware" extraction technique. It can use collected labels to more strictly define where each section instance is located. For more information on Label Sets, visit the [[Label Sets]] article. | ||
<!--#region About--> | |||
=== About === | === About === | ||
| Line 74: | Line 74: | ||
[[File:Transaction-detection-about-03.png]] | [[File:Transaction-detection-about-03.png]] | ||
|} | |} | ||
<!--#endregion--> | |||
<!--#region Configuring Transaction Detection with Binding Fields--> | |||
== Configuring Transaction Detection with Binding Fields == | == Configuring Transaction Detection with Binding Fields == | ||
| Line 85: | Line 86: | ||
# This '''Data Field''' utilizes a simple ''Pattern Match'' extractor for its '''''Value Extractor''''' assignment. | # This '''Data Field''' utilizes a simple ''Pattern Match'' extractor for its '''''Value Extractor''''' assignment. | ||
#* Which returns the production unit numbers on the document using a simple pattern <code>\d{3}-\d{6}-\d-\d{3}</code> | #* Which returns the production unit numbers on the document using a simple pattern <code>\d{3}-\d{6}-\d-\d{3}</code> | ||
| | |||
[[File:2023 Transaction Detection - 2023 01 About 01.png]] | |||
|- | |||
|valign=top| | |||
# Importantly, notice this returns '''five''' result candidates (when testing extraction at the '''Data Field''' level in the Node Tree). | # Importantly, notice this returns '''five''' result candidates (when testing extraction at the '''Data Field''' level in the Node Tree). | ||
#* This will be important because we want to end up creating five section instances. If you expect to return five section instances, your '''''Binding Field's''''' extractor (or '''''Binding Fields''''' extractors if using more than one) will need to return five results. | #* This will be important because we want to end up creating five section instances. If you expect to return five section instances, your '''''Binding Field's''''' extractor (or '''''Binding Fields''''' extractors if using more than one) will need to return five results. | ||
| | | | ||
[[File:Transaction- | [[File:2023 Transaction Detection - 2023 01 About 02.png]] | ||
|- | |- | ||
|valign=top| | |valign=top| | ||
| Line 106: | Line 111: | ||
|} | |} | ||
| | | | ||
[[File:Transaction- | [[File:2023 Transaction Detection - 2023 01 About 03.png]] | ||
|- | |- | ||
|valign=top| | |valign=top| | ||
| Line 112: | Line 117: | ||
# How '''Grooper''' goes about detecting these periodic patterns is controlled by the '''''Periodicity Detection''''' set of properties. | # How '''Grooper''' goes about detecting these periodic patterns is controlled by the '''''Periodicity Detection''''' set of properties. | ||
| | |||
[[File:2023 Transaction Detection - 2023 01 About 04.png]] | |||
|- | |||
|valign=top| | |||
# In our case, five section instances were established, one for the each result from the "Production Unit Number" '''Data Field's''' '''''Value Extractor'''''. | # In our case, five section instances were established, one for the each result from the "Production Unit Number" '''Data Field's''' '''''Value Extractor'''''. | ||
| | | | ||
[[File:Transaction- | [[File:2023 Transaction Detection - 2023 01 About 05.png]] | ||
|- | |- | ||
|valign=top| | |valign=top| | ||
| Line 122: | Line 131: | ||
| | | | ||
:1. If you need to trouble shoot the ''Transaction Detection'' method's results, the "Diagnostics" tab can give you additional information as to how '''Grooper''' detected these repeating patterns in the document's text data. | :1. If you need to trouble shoot the ''Transaction Detection'' method's results, the "Diagnostics" tab can give you additional information as to how '''Grooper''' detected these repeating patterns in the document's text data. | ||
: | :1. You will find the following reports for the '''Data Section''' | ||
:: 1. Execution Log | :: 1. Execution Log | ||
:: 2. Preprocessed Document | :: 2. Preprocessed Document | ||
:: 3 | :: 3. Periodicity Matrix | ||
|} | |} | ||
| | | | ||
[[File:Transaction- | [[File:2023 Transaction Detection - 2023 01 About 06.png]] | ||
|} | |} | ||
<!--#endregion--> | |||
<!--#region Configuring Transaction Detection with Label Sets--> | |||
<!-- | <!-- | ||
== Configuring Transaction Detection with Label Sets == | == Configuring Transaction Detection with Label Sets == | ||
| Line 174: | Line 183: | ||
[[Category:Articles]] | [[Category:Articles]] | ||
[[Category:Version 2021]] | [[Category:Version 2021]] | ||
<!--#endregion--> | |||
Revision as of 09:14, 25 October 2023
|
WIP |
This article is a work-in-progress or created as a placeholder for testing purposes. This article is subject to change and/or expansion. It may be incomplete, inaccurate, or stop abruptly. This tag will be removed upon draft completion. |
Transaction Detection is a Data Section Extract Method. This extraction method produces section instances by detecting repeating patterns of text around the Data Section's child Data Fields.
This Data Section Extract Method is unique in that it leverages the Data Section's child Data Fields' extracted values to produce the section instances. Normally, the section instance is established first, then the Data Fields' extractors returns results from the text data in the instance. In the case of Transaction Detection, one or more Data Fields are specified as "Binding Fields". These Binding Fields extract against the whole document to detect periodic patterns to establish each section instance.
Transaction Detection is also a "Label Set aware" extraction technique. It can use collected labels to more strictly define where each section instance is located. For more information on Label Sets, visit the Label Sets article.
About
|
The Transaction Detection section extraction method is useful for semi-structured documents which have multiple sections which are themselves very structured, repeating the same (or at least very similar) field or table data. For example, take this monthly tax reporting form from the Oklahoma Tax Commission. There are five sections of information on this document listed as "A" "B" "C" "D" and "E". Each of these sections collect the exact same set of data:
The Transaction Detection method looks for periodic similarity (also referred to as "periodicity") to sub-divide a document into multiple section instances. |
|||
|
For structured information like this, one way you can define where each section starts and stops is just by the patterns in the fields themselves. These values are periodic. They appear at set intervals, marking the boundaries of each section. For example,
The Transaction Detection method detects the periodic patterns in these values to divide up the document into sections, forming one section instance from each periodic pattern of data detected. Part of how the Transaction Detection detects these patterns is by using extractors configured in the Data Section's child Data Field objects. These are called Binding Fields. Grooper uses the results matched by these Data Fields to detect each periodic section. For example, you might have a "Production Unit Number" Data Field for these section that returns five results, one for each section. Once these five results are established, Grooper will look for other patterns around these results to establish the boundaries of each of the five sections. |
|||
|
The Transaction Detection method also analyzes a document's text line-by-line looking for repeated lines that are highly similar to each other. For example, each of the yellow highlighted lines are extremely similar. They are essentially identical except for the starting character on each line (either "A" "B" "C" "D" or "E"), this repeated pattern is a good indication that we have a set of repeated (or "periodic") sections of information. Furthermore, the next lines, highlighted in blue, are also similar as long as you normalize the data a bit. If you replace the specific number with just a "#" symbol, they too are nearly identical. The Transaction Detection method will further go line-by-line comparing the text on each one to subsequent lines, looking for repeating patterns. Such is the case for the rest of the green highlighted lines. Even accounting for OCR errors, each line is similar enough to detect a pattern. We have 5 sets of very similar lines of text. We have ultimately 5 section instances returned for the Data Section. Lastly, eventually Grooper will detect a line that does not fit the pattern. The red highlighted line is totally dissimilar from the set of similar lines detected previously. This is where Grooper "knows" to stop. Not fitting the periodic pattern, this marks a stopping point. This text is left out of the last section instance and with no further lines matching the detected periodic pattern, no further section instances are returned.
|
Configuring Transaction Detection with Binding Fields
|
Without utilizing Label Sets, the Transaction Detection sectioning method must assign at least one Binding Field in order to detect the periodic similarity among lines of text in a document, ultimately forming the Data Section's section instances.
|
|||
|
|||
|
Next, we will configure a the "Production Info" Data Section to create section instances using the Transaction Detection method.
|
|||
|
The Transaction Detection method will then go through the line-by-line comparison process around the Binding Fields to detect periodic similarities to produce section instances.
|
|||
|
|||
|








