2023:Transaction Detection (Section Extract Method): Difference between revisions
No edit summary |
No edit summary |
||
| Line 82: | Line 82: | ||
Without utilizing Label Sets, the ''Transaction Detection'' sectioning method must assign ''at least one'' '''''Binding Field''''' in order to detect the periodic similarity among lines of text in a document, ultimately forming the '''Data Section's''' section instances. | Without utilizing Label Sets, the ''Transaction Detection'' sectioning method must assign ''at least one'' '''''Binding Field''''' in order to detect the periodic similarity among lines of text in a document, ultimately forming the '''Data Section's''' section instances. | ||
# For this example, we will end up configuring the "Production | # For this example, we will end up configuring the "Production" '''Data Section''' of this '''Data Model'''. | ||
# We will utilize the "Production Unit Number" as the '''''Binding Field'''''. | # We will utilize the "Production Unit Number" as the '''''Binding Field'''''. | ||
# This '''Data Field''' utilizes a simple ''Pattern Match'' extractor for its '''''Value Extractor''''' assignment. | # This '''Data Field''' utilizes a simple ''Pattern Match'' extractor for its '''''Value Extractor''''' assignment. | ||
| Line 96: | Line 96: | ||
|- | |- | ||
|valign=top| | |valign=top| | ||
Next, we will configure a the "Production | Next, we will configure a the "Production" '''Data Section''' to create section instances using the ''Transaction Detection'' method. | ||
# Select the '''Data Section''' in the Node Tree. | # Select the '''Data Section''' in the Node Tree. | ||
# Using the '''''Extract Method''''' property, select ''Transaction Detection''. | # Using the '''''Extract Method''''' property, select ''Transaction Detection''. | ||
# | # Click the ellipsis button to the right of the '''''Binding Fields''''' property. | ||
# | # This will bring up a new window. Select which '''Data Fields''' in the '''Data Section''' should be used as '''''Binding Fields''''' by checking the box next to the '''Data Field'''. | ||
#* Here, we have selected the "Production Unit Number" '''Data Field'''. | #* Here, we have selected the "Production Unit Number" '''Data Field'''. | ||
| Line 122: | Line 122: | ||
|valign=top| | |valign=top| | ||
# In our case, five section instances were established, one for the each result from the "Production Unit Number" '''Data Field's''' '''''Value Extractor'''''. | # In our case, five section instances were established, one for the each result from the "Production Unit Number" '''Data Field's''' '''''Value Extractor'''''. | ||
# If you would like, feel free to click on the "Diagnostics" icon to open up the "Diagnostics" in a new tabbed window. | |||
If you need to trouble shoot the ''Transaction Detection'' method's results, the "Diagnostics" can give you additional information as to how '''Grooper''' detected these repeating patterns in the document's text data. | |||
| | | | ||
[[File:2023 Transaction Detection - 2023 01 About 05.png]] | [[File:2023 Transaction Detection - 2023 01 About 05.png]] | ||
| Line 130: | Line 133: | ||
|style="font-size:14pt"|'''FYI''' | |style="font-size:14pt"|'''FYI''' | ||
| | | | ||
:1. | |||
:1. In "Diagnostics", you will find the following reports for the '''Data Section''' | |||
:: 1. Execution Log | :: 1. Execution Log | ||
:: 2. Preprocessed Document | :: 2. Preprocessed Document | ||
Revision as of 09:19, 25 October 2023
|
WIP |
This article is a work-in-progress or created as a placeholder for testing purposes. This article is subject to change and/or expansion. It may be incomplete, inaccurate, or stop abruptly. This tag will be removed upon draft completion. |
Transaction Detection is a Data Section Extract Method. This extraction method produces section instances by detecting repeating patterns of text around the Data Section's child Data Fields.
This Data Section Extract Method is unique in that it leverages the Data Section's child Data Fields' extracted values to produce the section instances. Normally, the section instance is established first, then the Data Fields' extractors returns results from the text data in the instance. In the case of Transaction Detection, one or more Data Fields are specified as "Binding Fields". These Binding Fields extract against the whole document to detect periodic patterns to establish each section instance.
Transaction Detection is also a "Label Set aware" extraction technique. It can use collected labels to more strictly define where each section instance is located. For more information on Label Sets, visit the Label Sets article.
About
|
The Transaction Detection section extraction method is useful for semi-structured documents which have multiple sections which are themselves very structured, repeating the same (or at least very similar) field or table data. For example, take this monthly tax reporting form from the Oklahoma Tax Commission. There are five sections of information on this document listed as "A" "B" "C" "D" and "E". Each of these sections collect the exact same set of data:
The Transaction Detection method looks for periodic similarity (also referred to as "periodicity") to sub-divide a document into multiple section instances. |
|||
|
For structured information like this, one way you can define where each section starts and stops is just by the patterns in the fields themselves. These values are periodic. They appear at set intervals, marking the boundaries of each section. For example,
The Transaction Detection method detects the periodic patterns in these values to divide up the document into sections, forming one section instance from each periodic pattern of data detected. Part of how the Transaction Detection detects these patterns is by using extractors configured in the Data Section's child Data Field objects. These are called Binding Fields. Grooper uses the results matched by these Data Fields to detect each periodic section. For example, you might have a "Production Unit Number" Data Field for these section that returns five results, one for each section. Once these five results are established, Grooper will look for other patterns around these results to establish the boundaries of each of the five sections. |
|||
|
The Transaction Detection method also analyzes a document's text line-by-line looking for repeated lines that are highly similar to each other. For example, each of the yellow highlighted lines are extremely similar. They are essentially identical except for the starting character on each line (either "A" "B" "C" "D" or "E"), this repeated pattern is a good indication that we have a set of repeated (or "periodic") sections of information. Furthermore, the next lines, highlighted in blue, are also similar as long as you normalize the data a bit. If you replace the specific number with just a "#" symbol, they too are nearly identical. The Transaction Detection method will further go line-by-line comparing the text on each one to subsequent lines, looking for repeating patterns. Such is the case for the rest of the green highlighted lines. Even accounting for OCR errors, each line is similar enough to detect a pattern. We have 5 sets of very similar lines of text. We have ultimately 5 section instances returned for the Data Section. Lastly, eventually Grooper will detect a line that does not fit the pattern. The red highlighted line is totally dissimilar from the set of similar lines detected previously. This is where Grooper "knows" to stop. Not fitting the periodic pattern, this marks a stopping point. This text is left out of the last section instance and with no further lines matching the detected periodic pattern, no further section instances are returned.
|
Configuring Transaction Detection with Binding Fields
|
Without utilizing Label Sets, the Transaction Detection sectioning method must assign at least one Binding Field in order to detect the periodic similarity among lines of text in a document, ultimately forming the Data Section's section instances.
|
|||
|
|||
|
Next, we will configure a the "Production" Data Section to create section instances using the Transaction Detection method.
|
|||
|
The Transaction Detection method will then go through the line-by-line comparison process around the Binding Fields to detect periodic similarities to produce section instances.
|
|||
If you need to trouble shoot the Transaction Detection method's results, the "Diagnostics" can give you additional information as to how Grooper detected these repeating patterns in the document's text data. |
|||
|








