2021:Transaction Detection (Section Extract Method)

From Grooper Wiki

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

202520232021

Transaction Detection is a insert_page_break Data Section Extract Method. This extraction method produces section instances by detecting repeating patterns of text around the Data Section's child variables Data Fields.

This Data Section Extract Method is unique in that it leverages the Data Section's child Data Fields' extracted values to produce the section instances. Normally, the section instance is established first, then the Data Fields' extractors returns results from the text data in the instance. In the case of Transaction Detection, one or more Data Fields are specified as "Binding Fields". These Binding Fields extract against the whole document to detect periodic patterns to establish each section instance.

Transaction Detection is also a "Label Set aware" extraction technique. It can use collected labels to more strictly define where each section instance is located. For more information on Label Sets, visit the Label Sets article.

About

The Transaction Detection section extraction method is useful for semi-structured documents which have multiple sections which are themselves very structured, repeating the same (or at least very similar) field or table data.

For example, take this monthly tax reporting form from the Oklahoma Tax Commission.

There are five sections of information on this document listed as "A" "B" "C" "D" and "E". Each of these sections collect the exact same set of data:

  1. A "Production Unit Number" assigned to an oil or natural gas well.
  2. A "Purchaser/Producers Report Number"
  3. The "Gross Volume" of oil or natural gas produced
  4. The "Gross Value" dollar amount of oil or natural gas produced
  5. The "Qualifying Tax Rate" ultimately used to calculate the tax due for the well's production.
  6. And so on.

The Transaction Detection method looks for periodic similarity (also referred to as "periodicity") to sub-divide a document into multiple section instances.

For structured information like this, one way you can define where each section starts and stops is just by the patterns in the fields themselves. These values are periodic. They appear at set intervals, marking the boundaries of each section.

For example,

  1. The "Production Unit Number" is always found at the start of the section.
  2. The "Exempt Volume" is always found somewhere in the middle of the section.
  3. The "Petroleum Excise Tax Due" is always found at the end.

The Transaction Detection method detects the periodic patterns in these values to divide up the document into sections, forming one section instance from each periodic pattern of data detected. Part of how the Transaction Detection detects these patterns is by using extractors configured in the Data Section's child Data Field objects. These are called Binding Fields.

Grooper uses the results matched by these Data Fields to detect each periodic section. For example, you might have a "Production Unit Number" Data Field for these section that returns five results, one for each section. Once these five results are established, Grooper will look for other patterns around these results to establish the boundaries of each of the five sections.

The Transaction Detection method also analyzes a document's text line-by-line looking for repeated lines that are highly similar to each other.

For example, each of the yellow highlighted lines are extremely similar. They are essentially identical except for the starting character on each line (either "A" "B" "C" "D" or "E"), this repeated pattern is a good indication that we have a set of repeated (or "periodic") sections of information.

Furthermore, the next lines, highlighted in blue, are also similar as long as you normalize the data a bit. If you replace the specific number with just a "#" symbol, they too are nearly identical.

The Transaction Detection method will further go line-by-line comparing the text on each one to subsequent lines, looking for repeating patterns. Such is the case for the rest of the green highlighted lines. Even accounting for OCR errors, each line is similar enough to detect a pattern. We have 5 sets of very similar lines of text. We have ultimately 5 section instances returned for the Data Section.

Lastly, eventually Grooper will detect a line that does not fit the pattern. The red highlighted line is totally dissimilar from the set of similar lines detected previously. This is where Grooper "knows" to stop. Not fitting the periodic pattern, this marks a stopping point. This text is left out of the last section instance and with no further lines matching the detected periodic pattern, no further section instances are returned.

The Transaction Detection method is not going to work well for every use case. It succeeds best where most of the data in the section is numerical in nature.

It's easy to normalize numeric data. Any individual number can be swapped for a "#" symbol. A currency value in on a line of text one section could be $988,000.00 and $112,433.00 in another but as far as comparing the lines for periodic similarity (also referred to as "periodicity"), they can both be normalized as "$###,###.##". Lexical data tends to be trickier. How do you normalize a name for example? How do you differentiate a name from a field label? You can do it with a variety of extraction techniques, but not using this line-by-line approach to determining how similar one line is to another.

This precisely is why it's called "Transaction" Detection. It works best with transactional data, which tends to be currency, quantity or otherwise numerical values. Indeed, this method was specifically designed for EOB (Explanation of Benefit) form processing and medical provider payment automation, in general.

Configuring Transaction Detection with Binding Fields

Without utilizing Label Sets, the Transaction Detection sectioning method must assign at least one Binding Field in order to detect the periodic similarity among lines of text in a document, ultimately forming the Data Section's section instances.

  1. For this example, we will end up configuring the "Production Info" Data Section of this Data Model.
  2. We will utilize the "Production Unit Number" as the Binding Field.
  3. This Data Field utilizes a simple Pattern Match extractor for its Value Extractor assignment.
    • Which returns the production unit numbers on the document using a simple pattern \d{3}-\d{6}-\d-\d{3}
  4. Importantly, notice this returns five result candidates (when testing extraction at the Data Field level in the Node Tree).
    • This will be important because we want to end up creating five section instances. If you expect to return five section instances, your Binding Field's extractor (or Binding Fields extractors if using more than one) will need to return five results.

Next, we will configure a the "Production Info" Data Section to create section instances using the Transaction Detection method.

  1. Select the Data Section in the Node Tree.
  2. Using the Extract Method property, select Transaction Detection.
  3. Select the Binding Fields property.
  4. Using the dropdown menu, select which Data Fields in the Data Section should be used as Binding Fields by checking the box next to the Data Field.
    • Here, we have selected the "Production Unit Number" Data Field.

For this example, all we need to do is assign this single Data Field as a Binding Field. There is enough similarity between the repeating section, that's all we need to do (For more complicated situations you may need multiple binding fields. Just be sure all Binding Fields are present in each section. No "optional" Data Fields for the Binding Fields.

The Transaction Detection method will then go through the line-by-line comparison process around the Binding Fields to detect periodic similarities to produce section instances.

  1. How Grooper goes about detecting these periodic patterns is controlled by the Periodicity Detection set of properties.
  2. In our case, five section instances were established, one for the each result from the "Production Unit Number" Data Field's Value Extractor.

FYI
1. If you need to trouble shoot the Transaction Detection method's results, the "Diagnostics" tab can give you additional information as to how Grooper detected these repeating patterns in the document's text data.
2. You will find the following reports for the Data Section
1. Execution Log
2. Preprocessed Document
3. Labels
4. Periodicity Matrix

Configuring Transaction Detection with Label Sets

Now that we understand the basics of the Transaction Detection method, we can look at how this sectioning method interacts with the Labeling Behavior. Its behavior is wildly different if a Header label is collected for the Data Section. Assuming you can collect a Header label for the Data Section, it is so different that a Binding Field is not even necessary to produce the section instances.

Establishing the section instances is almost as simple as...

  1. Start the section instance at the 'Header label.
  2. Stop the section instance at the next Header label (or Footer label)
  3. Repeat for every Header label found on the document.

For example, we have collected a Header label for the "Production Info" Data Section here.

  1. To add the label, we've selected the Content Model in the Node Tree.
  2. We've navigated to the "Labels" tab.
  3. We've selected the document in the Batch classified as the "OTC Form 300" Document Type.
    • In other words, this is the Label Set for the "OTC Form 300" Document Type.
  4. We've selected the Data Section in the Data Model.
  5. For the Header label, we've captured the first line of field labels.
    • 8. Production Unit Number 9. Purchasers/Producers Report Number 10. Gross Volume 11. Gross Value
  6. Notice we have five hits for this label, one at the start of each section.

Next, we will configure a the "Production Info" Data Section to create section instances using the Transaction Detection method.

  1. Select the Data Section in the Node Tree.
  2. Using the Extract Method property, select Transaction Detection.
  3. Notice no Binding Fields are selected.
  4. But we still get the five section instances returned!
    • In fact, for this example, no further configuration was required other than collecting the Data Section's Header label and setting the Extract Method to Transaction Detection.
  5. The section instance starts on the line containing the Header label.
  6. And it ends the line before the next Header label.
    • Then the second section instance starts at the second header and so on.