Paragraph Marking (2.9)

From Grooper Wiki
Jump to navigation Jump to search

Paragraph Marking alters the normal text data in a document by placing the carriage return and new line feed pairs at the end of each paragraph, instead of the end of each line. This allows users to break up a document's text flow into segments of paragraphs instead of segments of lines.

The Paragraph Marking property is enabled in the Preprocessing Options of the Pattern Editor window. The are several paragraph detection settings to determine what qualifies as a paragraph.


About

Paragraph Marking is part of Grooper's Natural Language Processing (NLP) solution. Normally, after text data is obtained from the Recognize activity, the carriage return and new line feed characters \r\n are inserted at the end of each line. For structured documents, such as forms and reports, this is extremely useful. Information is typically conveyed line by line. These characters can be helpful anchors to locate and parse data.

For structured documents, like this report, important information is laid out line by line.
Paragraph 1.png

However, unstructured documents, such as contracts and correspondence, convey information differently. Instead of information being broken up line by line, it is broken up paragraph by paragraph.

For unstructured documents, like this contract, the same information is there. However, it is embedded in paragraphs.
Paragraph 2.png


Paragraph Marking allows Grooper to alter the \r\n pairs within the text data. Instead of being placed at the end of each line, they are placed at the end of each paragraph.

Instead of \r\n pairs marking the end of each line... ...they mark the end of each paragraph.
Paragraph 3.png
Paragraph 4.png

Enabling the Paragraph Marking property first detects paragraphs in a document, then removes all \r\n pairs except the pair at the last line of the paragraph. Furthermore, the removed \r\n pairs are replaced with a single space character to keep the normal text flow intact. That way the word at the end of a line will have a space between it and the word at the beginning of the next line.


This allows you to do two very important things when extracting data from unstructured documents.

1. You can easily target data existing between a line break.
Imagine you want to extract the highlighted address in this paragraph. It starts on one line and ends on another.
Paragraph 5.png
Under normal circumstances, there is a \r\n pair between the "18" and "18th" of "18 18th Street"
Paragraph 6.png
If you're extracting data from a paragraph, there's a good chance it could be broken up between lines like this. Furthermore, there's no telling at what point in the string it jumps to the next line. You could potentially plan for this in your extractor's regular expression pattern, including the option for every space character to also be a \r\n pair. However, your pattern will get bulky and needlessly complicated.
Instead, Paragraph Marking formats the text in a way that accounts for this. By removing the \r\n pairs at the end of each line, it turns the whole paragraph into one big line of text.

This allows a simpler data extractor to return the address. The \r\n that would normally interfere with a match are no longer there.

Paragraph 7.png
2. You can section out a document into paragraphs.
Imagine a series of contracts. All of them contain some information you want to extract from a certain clause.


Paragraph 8.png


If you can locate that clause's paragraph, you can limit extraction to match just the text within that paragraph, rather than the full document.


Paragraph 9.png


Paragraph 10.png

What Is a Paragraph?

Paragraph Marking detects paragraphs within a document in many different ways. In order to understand how Grooper does this, ask yourself "How do you know what a paragraph is?"

You don't even necessarily need to have text to distinguish between one paragraph and another. Even without real text, you can probably figure out what separates each of these "paragraphs" for these "documents".

Paragraph 12.png

The space between paragraphs is larger than the space between lines.

Paragraph 11.png

Paragraphs are often indented at the beginning.

Paragraph 13.png

Sometimes, bullets are used as paragraph markers.


Paragraph Marking finds paragraphs similarly. It uses the change in spacing between lines and other markers such as indentation and text bullets to determine where a new paragraph should start.

See the How To section of this article for more in depth information on how to configure Paragraph Marking to detect paragraphs.

Use Cases

Paragraph Marking is part of Grooper's Natural Language Processing (NLP) solution. It aids in data extraction from unstructured documents.

Imagine you have a collection of Non-Disclosure Agreements. In these contracts, there is a certain clause you're looking for with certain information. For example, a "Governing Law" clause that states which state has jurisdiction should there be a dispute. You'll need to find the paragraph (or paragraphs) that make up the clause and extract which state has jurisdiction from that clause. You'll use Paragraph Marking in a few ways to do this:

  1. To break up the contract into paragraphs instead of lines
  2. To use as a Value Extractor on a Field Class classifying the paragraph appropriately as a Governing Law clause.
  3. To use the processed text data (with \r\n pairs removed) to find the state holding jursdiction.

This is one of many examples. Any time you need to section a document into paragraphs and extract data from those paragraphs, you will likely take advantage of Paragraph Marking

How To

Configuring Paragraph Marking is all about configuring how it detects paragraphs. So much so that Paragraph Marking is often referred to as paragraph detection. As mentioned above, this is done similarly to how you as a human reader breaks up paragraphs, mostly using space between paragraphs, indentations or other cues to mark a new paragraph. These configuration settings are detailed bellow.

If you want to follow along, using the documents in this example, download the zip file linked below and import it into your Grooper environment.

Paragraph Marking is a Preprocessing Option for any regex pattern extractor in Grooper, including the Pattern property of Data Types, Data Formats, and any Internal Extractors

For this example, a Content Model named "Paragraph Marking" was added to the Node Tree with its Local Resources Folder and Data Model.

1. Add a Data Type to the Local Resources Folder and name it "Paragraphs".

2. Add a Data Format to the "Paragraphs" extractor and name it "Detection Settings".

3. For the value pattern, enter [^\r\n]+. This will match every character until a \r\n pair is found. For now, this captures every single line of text. We will use this pattern to visualize paragraph marking, to see if Grooper successfully removes all \r\n pairs except for the end of each paragraph. Notice without Paragraph Marking enabled, this pattern returns each line of text.


Paragraph 14.png


4. To enable Paragraph Marking switch to the Properties tab.

5. Expand the Preprocessing Options property.

6. Select the Paragraph Marking property and change it from Disabled to Enabled. (You may do this by either selecting Enabled from the dropdown list or double-clicking the Paragraph Marking property.


Paragraph 15.png


Paragraph 16.png


Now that we know how to turn Paragraph Marking on, the next tabs will look at how it detects paragraphs through its configurable properties.

7. Expand the Paragraph Marking property to see all its configurable properties.


Paragraph 17.png

The first thing Paragraph Marking does behind the scenes is look for lines of text after another that have the same (or nearly the same) line spacing between them. It is generally the case that a single paragraph will have the same amount of space between lines of text.

Paragraph 20.png

From there, Paragraph Marking's properties alter how these lines are grouped together and when a new paragraph should start.

The Maximum Line Spacing property sets the maximum height of line spacing inside a paragraph, in inches. Note, this includes the height of the white-space bellow a line as well as the text line's height itself.

In other words, Line Spacing = Text Height + Gap Height


Paragraph 18.png
Paragraph 19.png


So, if a line of text measures 0.2 inches and the gap bellow measures 0.1 inches, its line spacing height is 0.3 inches. Now, if the gap between the final line of text on one paragraph and starting line of another is 0.2 inches. That line spacing height is 0.4 inches. Setting the Maximum Line Spacing property higher than 0.4 inches would allow both paragraphs to be counted as one. Setting it bellow 0.4 inches will give paragraph detection an awareness that that gap is too big, and Paragraph Marking will start a new paragraph.

You can see this in our example in Grooper. All properties besides 'Maximum Line Spacing have been set to 0 or False to illustrate only how line spacing affects paragraph detection. The default property of 0.4 is too large. All paragraphs are detected as one big paragraph.


Paragraph 21.png


Changing that property to 0.3 will start a new paragraph if the line spacing height is larger than 0.3 inches. This will break up the text into its component paragraphs.


Paragraph 22.png

With a large document set however, there's no guarantee the maximum line spacing will always be the same literal height. Even within a single document, the line spacing between lines in a paragraph may change.

For example, this document switches between single and double spaced paragraphs. You can see the last paragraph is broken up in two paragraphs, when it should be one.


Paragraph 23.png


This is resolved by the Paragraph Spacing Ratio property. Instead of using a specific value for the height, it looks for the percentage by which the line spacing increases after a final line. Using the default of 125% marks a new paragraph when the line spacing height exceeds 25% the normal height.

Setting this property back to 125% will keep this paragraph as a single paragraph instead of two. First, the line spacing between the first two lines is measured. Then the line spacing between the third and fourth are found to be less than 125% of that height. So, a new paragraph is not created.


Paragraph 24.png


However, do note this can cause some issues. Grooper looks at the space between two lines to determine the baseline for the line spacing height. Then, it looks the the spacing between the second and third to determine if its within the set ratio.

First, this setting only impacts paragraphs that contain at least two lines of text. If a single line is detected as a paragraph, it won't have the context of the second line to figure out what the line spacing should be.

Second, if you have a paragraph with smaller spacing following a paragraph with larger spacing, the paragraph with smaller spacing can be consumed by the first. Since, the average line spacing was detected first the paragraphs with the larger line spacing, the paragraph with smaller spacing may not be separated by any space larger than the first paragraphs normal spacing. Furthermore, the space between the lines of the second paragraph are smaller then those of the first. So, a new paragraph won't be detected until it finds a gap exceeding the Paragraph Spacing Ratio of those lines.

You can see this in action below. The second paragraph gets combined with the first because the space between it and the third paragraph are the same as the space between the lines in the second paragraph. And a new paragraph isn't detected until the fourth paragraph, since the space between the lines of the third paragraph and the fourth does exceed the set ratio.


Paragraph 25.png


In an ideal world, there would be a little extra space between the second and third paragraph. However, the world of document processing is not always ideal. There is a solution for this particular case, demonstrated by the next property, Minimum Line Width.

One common feature of paragraphs is the last line tends to be smaller than the rest.

Paragraph 26-2.png


This is not always the case, but it can serve as another breaking point for where to stop and start a paragraph.

In our poorly spaced example, Paragraph Spacing Ratio caused the second and third paragraph to be detected as a single paragraph. However, the last line of the second paragraph is only about 1.75 inches long.

If we reset the Minimum Line Width to the previous default of 3, Paragraph Marking will start a new paragraph if the last line of a paragraph is less than 3 inches. This will give us the result we want.


Paragraph 27.png


However, the last line of a paragraph is not always the shortest line of the bunch. Setting this property too high can cause paragraph detection to break up single paragraphs into multiple lines or paragraphs.

You can see the results below setting the Minimum Line Width to 6.


Paragraph 28.png

The Line Wrap Threshold property breaks up lines into new paragraphs based on the width of the longest line on the page.

This is a variation of the Minimum Line Width' property. It also uses the assumption that the narrowest line on a paragraph should be the last line of a paragraph. In general, while there may be some variation in the widths of individual lines in a paragraph, the last line will be much shorter. Instead of using a hard number, it uses a minimum percentage of the longest line on the document.

Using the default property of 60%, if a line is less than 60% the width of the longest line on the document, a new paragraph will begin. As seen in the example below, this creates two breaks in the middle paragraph.


Paragraph 38.png


If you increase this threshold, you increase the size of a line that marks a new paragraph. Increasing this property above 60% to 80%, in this example, separates each line in the narrower paragraph as a new "paragraph".


Paragraph 39.png


If you decrease this threshold, you decrease the size of a line that marks a new paragraph. Decreasing this property below 60% to 50%, in this example keeps, the narrower paragraph as a single paragraph. Only the last line is below the Line Wrap Threshold, which is where the new paragraph is detected and marked.


Paragraph 40.png


Aside from the space between paragraphs, indenting the first line is the biggest visual indication of a new paragraph. If your documents have indented paragraphs, you can tell where a new paragraph starts even without extra space between them.

Paragraph Marking uses the Indent Size property to detect where a new paragraph starts. As long as the indented length is longer than this number, a new paragraph is detected.

This is measured from the beginning of the line of text before the indented line to the beginning of the indented line.


Paragraph 29.png


As you can see below, even without extra space between the paragraphs, using the Indent Size property, starts a new paragraph every time an indentation is found larger than the set value (Here the default of 0.2).


Paragraph 30.png


Furthermore, this measurement goes both ways. It will look for paragraphs indented out as well as in from the previous line.


Paragraph 31.png

With any longer or more complicated unstructured document, you'll often see numbered or lettered bullets to break up paragraphs. These can be very helpful indicators where a paragraph starts. The Detect Bullet property will start a new paragraph whenever it sees text bullets like the following.

Numbered Bullets Lowercase Bullets Uppercase Bullets Roman Numeral Bullets
1. Line of text a. Line of text A. Line of text i. Line of text
1) Line of text a) Line of text A) Line of text i) Line of text
(1) Line of text (a) Line of text (A) Line of text (i) Line of text

Turning this property to True will mark a new paragraph where bulleted headings are found.


Paragraph 32.png


The First Line Extractor' property allows you to "hard code" where a paragraph should start. Any text matching the extractor set here will start a new paragraph.

Interestingly, dotted bullets are not part of Paragraph Markings bullet detection. This is because, typically, these characters are not picked up by OCR. If they are recognized, they are recognized as a period or another character.

Seen in this example, the text is marked as one huge paragraph.

However, since this document's text was extracted via the PDF text extract functionality of the Recognize activity, these were actually extracted as dotted bullet characters.


Paragraph 33.png


If we set the First Line Extractor to a simple Internal pattern, and enter the dotted bullet in the Value Pattern, it will match these bullets.


Paragraph 34.png


Now, Paragraph Marking detects a new paragraph every time the extractor returns a result. In this case, every time that dotted bullet character is returned. And, we get the result we want, each paragraph separated appropriately.


Paragraph 35.png

The Maximum Horizontal Gap property controls how large a gap on a single line should be allowed before a new paragraph is detected.

This can cause some problems for unstructured document extraction. Often you will see "fill in the blank" style contracts or other forms where you enter in a name or other information. For our example the name "Scrooge" is typed onto an underlined word with lots of space left on each side. With the default Maximum Horizontal Gap of 0.5, a new paragraph is started every time there is 0.5 inches between two pieces of text. As seen in the example below, this makes a new paragraph out of each line with that gap.

(Note: The Indent Size property was disabled, setting it to 0 so that property does not interfere with our results)


Paragraph 41.png


You can get around this issue by increasing the Maximum Horizontal Gap. Increasing it to something very large, such as 10, will allow a gap of 10 inches between characters on a single line. This will effectively disable the property, using the other properties in Paragraph Marking to detect and mark new paragraphs.


Paragraph 42.png

The Consider Underline property alters the allowable gap between characters on a line if an underline is detected. First, this information must be present in the page's LayoutData.json file. A line must be detected from either a Line Detection or Line Removal IP Command during the Image Processing or Recognize activity in order for Grooper to know where that line is and how big it is.

With this information present, Consider Underline will treat the underline as if it were text, even if there is nothing typed on the line, and a new paragraph will not be created if the gap between characters exceeds the Maximum Horizontal Gap property's value.


Paragraph 43.png


Paragraph 44.png

! In order for Consider Underline to work, line data must be present in the page's layout data. If Grooper does not know that line is there, the space between characters will still break the paragraph up if that space exceeds the Maximum Horizontal Gap property's value. The line data must be obtained via a Line Detection or Line Removal command using either the Image Processing or Recognize activity prior to running Paragraph Marking.

The Detect Double Space property doesn't have anything to do with marking a new paragraph. Instead, it has to do with how \r\n pairs are inserted in the text data.

If this property detects a paragraph is double spaced, it will insert a second \r\n pair. So, the text data will look something like this.

single spaced paragraph<\r><\n>
single spaced paragraph<\r><\n>
double spaced paragraph<\r><\n>
<\r><\n>
single spaced paragraph<\r><\n>

As you see below, setting Detect Double Space to False, Paragraph Marking still detects the second paragraph (which is double spaced) using the other detection properties. However the text data has only a single \r\n pair at the end of the paragraph.


Paragraph 36.png


Setting the Detect Double Space to True will insert an additional \r\n pair after the double spaced paragraph.


Paragraph 37.png


This can be helpful if you need to track down specifically double spaced paragraphs in a document. You will be able to distinguish them from single spaced paragraphs, knowing you can match two \r\n pairs instead of one.


Version Differences

Prior to Grooper 2.9 the Line Wrap Threshold and Consider Underline properties did not exist.

For more information on these properties, visit the How To section of this article.

For the 2.8 version of this article, please follow this link.