2023:Transaction Detection (Section Extract Method)

From Grooper Wiki
Revision as of 10:33, 27 August 2024 by Randallkinard (talk | contribs)

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

202520232021

Transaction Detection is a insert_page_break Data Section Extract Method. This extraction method produces section instances by detecting repeating patterns of text around the Data Section's child variables Data Fields.

This Data Section Extract Method is unique in that it leverages the Data Section's child Data Fields' extracted values to produce the section instances. Normally, the section instance is established first, then the Data Fields' extractors returns results from the text data in the instance. In the case of Transaction Detection, one or more Data Fields are specified as "Binding Fields". These Binding Fields extract against the whole document to detect periodic patterns to establish each section instance.

Transaction Detection is also a "Label Set aware" extraction technique. It can use collected labels to more strictly define where each section instance is located. For more information on Label Sets, visit the Label Sets article.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

About

About Transaction Detection

The Transaction Detection section extraction method is useful for semi-structured documents which have multiple sections which are themselves very structured, repeating the same (or at least very similar) field or table data.

For example, take this monthly tax reporting form from the Oklahoma Tax Commission.

There are five sections of information on this document listed as "A" "B" "C" "D" and "E". Each of these sections collect the exact same set of data:

  1. A "Production Unit Number" assigned to an oil or natural gas well.
  2. A "Purchaser/Producers Report Number"
  3. The "Gross Volume" of oil or natural gas produced
  4. The "Gross Value" dollar amount of oil or natural gas produced
  5. The "Qualifying Tax Rate" ultimately used to calculate the tax due for the well's production.
  6. And so on.

The Transaction Detection method looks for periodic similarity (also referred to as "periodicity") to sub-divide a document into multiple section instances.

For structured information like this, one way you can define where each section starts and stops is just by the patterns in the fields themselves. These values are periodic. They appear at set intervals, marking the boundaries of each section.

For example,

  1. The "Production Unit Number" is always found at the start of the section.
  2. The "Exempt Volume" is always found somewhere in the middle of the section.
  3. The "Petroleum Excise Tax Due" is always found at the end.

The Transaction Detection method detects the periodic patterns in these values to divide up the document into sections, forming one section instance from each periodic pattern of data detected. Part of how the Transaction Detection detects these patterns is by using extractors configured in the Data Section's child Data Field objects. These are called Binding Fields.

Grooper uses the results matched by these Data Fields to detect each periodic section. For example, you might have a "Production Unit Number" Data Field for these section that returns five results, one for each section. Once these five results are established, Grooper will look for other patterns around these results to establish the boundaries of each of the five sections.

The Transaction Detection method also analyzes a document's text line-by-line looking for repeated lines that are highly similar to each other.

For example, each of the yellow highlighted lines are extremely similar. They are essentially identical except for the starting character on each line (either "A" "B" "C" "D" or "E"), this repeated pattern is a good indication that we have a set of repeated (or "periodic") sections of information.

Furthermore, the next lines, highlighted in blue, are also similar as long as you normalize the data a bit. If you replace the specific number with just a "#" symbol, they too are nearly identical.

The Transaction Detection method will further go line-by-line comparing the text on each one to subsequent lines, looking for repeating patterns. Such is the case for the rest of the green highlighted lines. Even accounting for OCR errors, each line is similar enough to detect a pattern. We have 5 sets of very similar lines of text. We have ultimately 5 section instances returned for the Data Section.

Lastly, eventually Grooper will detect a line that does not fit the pattern. The red highlighted line is totally dissimilar from the set of similar lines detected previously. This is where Grooper "knows" to stop. Not fitting the periodic pattern, this marks a stopping point. This text is left out of the last section instance and with no further lines matching the detected periodic pattern, no further section instances are returned.

The Transaction Detection method is not going to work well for every use case. It succeeds best where most of the data in the section is numerical in nature.

It's easy to normalize numeric data. Any individual number can be swapped for a "#" symbol. A currency value in on a line of text one section could be $988,000.00 and $112,433.00 in another but as far as comparing the lines for periodic similarity (also referred to as "periodicity"), they can both be normalized as "$###,###.##". Lexical data tends to be trickier. How do you normalize a name for example? How do you differentiate a name from a field label? You can do it with a variety of extraction techniques, but not using this line-by-line approach to determining how similar one line is to another.

This precisely is why it's called "Transaction" Detection. It works best with transactional data, which tends to be currency, quantity or otherwise numerical values. Indeed, this method was specifically designed for EOB (Explanation of Benefit) from processing and medical provider payment automation, in general.

FYI

What does this have to do with Labeling Behavior and Label Sets?

We're getting there. Ultimately, Transaction Detection is "Label Set aware" and can take advantage of collected Header and Footer labels for a Data Section object. However, collecting labels for the Data Section will quite dramatically change how Transaction Detection works.

It is best to understand how this sectioning method works without Label Sets before we delve into how it works with them.

Configuring Transaction Detection with Binding Fields

Without utilizing Label Sets, the Transaction Detection sectioning method must assign at least one Binding Field in order to detect the periodic similarity among lines of text in a document, ultimately forming the Data Section's section instances.

  1. For this example, we will end up configuring the "Production Info" Data Section of this Data Model.
  2. We will utilize the "Production Unit Number" as the Binding Field.
  3. This Data Field utilizes a simple Pattern Match extractor for its Value Extractor assignment.

  1. The Pattern Match returns the production unit numbers on the document using a simple pattern \d{3}-\d{6}-\d-\d{3}.
  2. Importantly, notice this returns five result candidates (when testing extraction at the Data Field level in the Node Tree).
    • Due to the limited space on the screen, only four instances are here, but should you scroll down further on the Batch Viewer, you would see the fifth result being returned.
    • This will be important because we want to end up creating five section instances. If you expect to return five section instances, your Binding Field's extractor (or Binding Fields extractors if using more than one) will need to return five results.

Next, we will configure a the "Production Info" Data Section to create section instances using the Transaction Detection method.

  1. Select the Data Section in the Node Tree.
  2. Using the Extract Method property, select Transaction Detection.
  3. Click the ellipsis at the end of the Binding Fields property.

  1. Choose which Data Fields in the Data Section should be used as Binding Fields by checking the box next to the Data Field.
    • Here, we have selected the "Production Unit Number" Data Field.
  2. Click "OK".

For this example, all we need to do is assign this single Data Field as a Binding Field. There is enough similarity between the repeating section, that's all we need to do (For more complicated situations you may need multiple binding fields. Just be sure all Binding Fields are present in each section. No "optional" Data Fields for the Binding Fields.

The Transaction Detection method will then go through the line-by-line comparison process around the Binding Fields to detect periodic similarities to produce section instances.

  1. How Grooper goes about detecting these periodic patterns is controlled by the Periodicity Detection set of properties.

  1. In our case, five section instances were established, one for the each result from the "Production Unit Number" Data Field's Value Extractor.

  1. If you need to trouble shoot the Transaction Detection method's results, the "Diagnostics" can give you additional information as to how Grooper detected these repeating patterns in the document's text data. Click the "Diagnostics" button at the top of the tab.

  1. You will find several reports for the Data Section such as the "Execution Log".

Configuring Transaction Detection with label sets

Now that we understand the basics of the Transaction Detection method, we can look at how this sectioning method interacts with the Labeling Behavior. Its behavior is wildly different if a Header label is collected for the Data Section. Assuming you can collect a Header label for the Data Section, it is so different that a Binding Field is not even necessary to produce the section instances.

Establishing the section instances is almost as simple as...

  1. Start the section instance at the 'Header label.
  2. Stop the section instance at the next Header label (or Footer label)
  3. Repeat for every Header label found on the document.

For example, we have collected a Header label for the "Production Info" Data Section here.

  1. To add the label, we've selected the Content Model in the Node Tree.
  2. We've navigated to the "Labels" tab.
  3. We've selected the document in the Batch classified as the "OTC Form 300" Document Type.
    • In other words, this is the Label Set for the "OTC Form 300" Document Type.
  4. We've selected the Data Section in the Data Model. For the Header label, we've captured the first line of field labels.
    • 8. Production Unit Number 9. Purchasers/Producers Report Number 10. Gross Volume 11. Gross Value
  5. Notice we have five hits for this label, one at the start of each section.

Next, we will configure a the "Production Info" Data Section to create section instances using the Transaction Detection method.

  1. Select the Data Section in the Node Tree.
  2. Using the Extract Method property, select Transaction Detection.
  3. Notice no Binding Fields are selected.

  1. Let's go back to the Data Section.
  2. We have set the Extract Method to Transaction Detection.
  3. Note that we have no Binding Fields set on this Data Section.

  1. Here we are getting all five sections being returned!
  2. The section instance starts on the line containing the Header label.
  3. And it ends the line before the next Header label.
    • Then the second section instance starts at the second header and so on.


Glossary

Batch: inventory_2 Batch nodes are fundamental in Grooper's architecture. They are containers of documents that are moved through workflow mechanisms called settings Batch Processes. Documents and their pages are represented in Batches by a hierarchy of folder Batch Folders and contract Batch Pages.

Data Field: variables Data Fields represent a single value targeted for data extraction on a document. Data Fields are created as child nodes of a data_table Data Model and/or insert_page_break Data Sections.

  • Data Fields are frequently referred to simply as "fields".

Data Section: A insert_page_break Data Section is a container for Data Elements in a data_table Data Model. variables They can contain Data Fields, table Data Tables, and even Data Sections as child nodes and add hierarchy to a Data Model. They serve two main purposes:

  1. They can simply act as organizational buckets for Data Elements in larger Data Models.
  2. By configuring its "Extract Method", a Data Section can subdivide larger and more complex documents into smaller parts to assist in extraction.
    • "Single Instance" sections define a division (or "record") that appears only once on a document.
    • "Multi-Instance" sections define collection of repeating divisions (or "records").

Extract: export_notes Extract is an Activity that retrieves information from folder Batch Folder documents, as defined by Data Elements in a data_table Data Model. This is how Grooper locates unstructured data on your documents and collects it in a structured, usable format.

Labeling Behavior: A Labeling Behavior extends "label set" functionality to description Document Types. This allows you to collect field labels and other labels present on a document and use them in a variety of ways. This includes functionality for classification, field extraction, table extraction, and section extraction.

Project: package_2 Projects are the primary containers for configuration nodes within Grooper. The Project is where various processing objects such as stacks Content Models, settings Batch Processes, profile objects are stored. This makes resources easier to manage, easier to save, and simplifies how node references are made in a Grooper Repository.

Section Extract Method: The Extract Method property of a insert_page_break Data Section defines a "Section Extract Method" which specifies how section instances will be identified and extracted.

Transaction Detection: Transaction Detection is a insert_page_break Data Section Extract Method. This extraction method produces section instances by detecting repeating patterns of text around the Data Section's child variables Data Fields.