2.80:Paragraph Marking (Property)

Paragraph Marking alters the normal text data in a document by placing the carriage return and new line feed pairs at the end of each paragraph, instead of the end of each line. This allows users to break up a document's text flow into segments of paragraphs instead of segments of lines.

The Paragraph Marking property is enabled in the Preprocessing Options of the Pattern Editor. The are several paragraph detection settings to determine what qualifies as a paragraph.

About

Paragraph Marking is part of Grooper's Natural Language Processing (NLP) solution. Normally, after text data is obtained from the Recognize activity, the carriage return and new line feed characters \r\n are inserted at the end of each line. For structured documents, such as forms and reports, this is extremely useful. Information is typically conveyed line by line. These characters can be helpful anchors to locate and parse data.

For structured documents, like this report, important information is laid out line by line.

However, unstructured documents, such as contracts and correspondence, convey information differently. Instead of information being broken up line by line, it is broken up paragraph by paragraph.

For unstructured documents, like this contract, the same information is there. However, it is embedded in paragraphs.

Paragraph Marking allows Grooper to alter the \r\n pairs within the text data. Instead of being placed at the end of each line, they are placed at the end of each paragraph.

Instead of \r\n pairs marking the end of each line...	...they mark the end of each paragraph.

Enabling the Paragraph Marking property first detects paragraphs in a document, then removes all \r\n pairs except the pair at the last line of the paragraph. Furthermore, the removed \r\n pairs are replaced with a single space character to keep the normal text flow intact. That way the word at the end of a line will have a space between it and the word at the beginning of the next line.

This allows you to do two very important things when extracting data from unstructured documents.

1. You can easily target data existing between a line break.

Imagine you want to extract the highlighted address in this paragraph. It starts on one line and ends on another.

Under normal circumstances, there is a \r\n pair between the "18" and "18th" of "18 18th Street"

If you're extracting data from a paragraph, there's a good chance it could be broken up between lines like this. Furthermore, there's no telling at what point in the string it jumps to the next line. You could potentially plan for this in your extractor's regular expression pattern, including the option for every space character to also be a \r\n pair. However, your pattern will get bulky and needlessly complicated.

Instead, Paragraph Marking formats the text in a way that accounts for this. By removing the \r\n pairs at the end of each line, it turns the whole paragraph into one big line of text.

This allows a simpler data extractor to return the address. The \r\n that would normally interfere with a match are no longer there.

2. You can section out a document into paragraphs.

Imagine a series of contracts. All of them contain some information you want to extract from a certain clause.

If you can locate that clause's paragraph, you can limit extraction to match just the text within that paragraph, rather than the full document.

What Is a Paragraph?

Paragraph Marking detects paragraphs within a document in many different ways. In order to understand how Grooper does this, ask yourself "How do you know what a paragraph is?"

You don't even necessarily need to have text to distinguish between one paragraph and another. Even without real text, you can probably figure out what separates each of these "paragraphs" for these "documents".


How are these paragraphs separated? Click meHow are these paragraphs separated? Click me The space between paragraphs is larger than the space between lines.	How are these paragraphs separated? Click meHow are these paragraphs separated? Click me Paragraphs are often indented at the beginning.	How are these paragraphs separated? Click meHow are these paragraphs separated? Click me Sometimes, bullets are used as paragraph markers.

Use Cases

Paragraph Marking is part of Grooper's Natural Language Processing (NLP) solution. It aids in data extraction from unstructured documents.

Imagine you have a collection of Non-Disclosure Agreements. In these contracts, there is a certain clause you're looking for with certain information. For example, a "Governing Law" clause that states which state has jurisdiction should there be a dispute. You'll need to find the paragraph (or paragraphs) that make up the clause and extract which state has jurisdiction from that clause. You'll use Paragraph Marking in a few ways to do this:

To break up the contract into paragraphs instead of lines
To use as a Value Extractor on a Field Class classifying the paragraph appropriately as a Governing Law clause.
To use the processed text data (with \r\n pairs removed) to find the state holding jursdiction.

This is one of many examples. Any time you need to section a document into paragraphs and extract data from those paragraphs, you will likely take advantage of Paragraph Marking

Version Differences

Look forward to improvements in Paragraph Marking in version 2.9!

How To

Configuring Paragraph Marking is all about configuring how it detects paragraphs. So much so that Paragraph Marking is often referred to as paragraph detection. As mentioned above, this is done similarly to how you as a human reader breaks up paragraphs, mostly using space between paragraphs, indentations or other cues to mark a new paragraph. These configuration settings are detailed bellow.

Enable Paragraph MarkingMaximum Line SpacingParagraph Spacing RatioIndent SizeDetect BulletsMinimum Line WidthFirst Line ExtractorDetect Double Space

Paragraph Marking is a Preprocessing Option for any regex pattern extractor in Grooper, including the Pattern property of Data Types, Data Formats, and any Internal Extractors

For this example, a Content Model named "Paragraph Marking" was added to the Node Tree with its Local Resources Folder and Data Model.

1. Add a Data Type to the Local Resources Folder and name it "Paragraphs".

2. Add Data Format to the "Paragraphs" extractor and name it "Detection Settings".

3. For the value pattern, enter [^\r\n]+. This will match every character until a \r\n pair is found. For now, this captures every single line of text. We will use this pattern to visualize paragraph marking, to see if Grooper successfully removes all \r\n pairs except for the end of each paragraph. Notice without Paragraph Marking enabled, this pattern returns each line of text.

4. To enable Paragraph Marking switch to the Properties tab.

5. Expand the Preprocessing Options property.

6. Select the Paragraph Marking property and change it from Disabled to Enabled. (You may do this by either selecting Enabled from the dropdown list or double-clicking the Paragraph Marking property.