2.80:Paragraph Marking (Property): Difference between revisions

Revision as of 12:48, 20 April 2020

Paragraph Marking alters the normal text data in a document by placing the carriage return and new line feed pairs at the end of each paragraph, instead of the end of each line. This allows users to break up a document's text flow into segments of paragraphs instead of segments of lines.

The Paragraph Marking property is enabled in the Preprocessing Options of the Pattern Editor. The are several paragraph detection settings to determine what qualifies as a paragraph.

About

Paragraph Marking is part of Grooper's Natural Language Processing (NLP) solution. Normally, after text data is obtained from the Recognize activity, the carriage return and new line feed characters \r\n are inserted at the end of each line. For structured documents, such as forms and reports, this is extremely useful. Information is typically conveyed line by line. These characters can be helpful anchors to locate and parse data.

For structured documents, like this report, important information is laid out line by line.

However, unstructured documents, such as contracts and correspondence, convey information differently. Instead of information being broken up line by line, it is broken up paragraph by paragraph.

For unstructured documents, like this contract, the same information is there. However, it is embedded in paragraphs.

Paragraph Marking allows Grooper to alter the \r\n pairs within the text data. Instead of being placed at the end of each line, they are placed at the end of each paragraph.

Instead of \r\n pairs marking the end of each line...	...they mark the end of each paragraph.

Enabling the Paragraph Marking property first detect paragraphs in a document then remove all \r\n pairs except the pair at the last line of the paragraph. Furthermore, the removed \r\n pairs are replaced with a single space character to keep the normal text flow intact. That way the word at the end of a line will have a space between it and the word at the beginning of the next line.

This allows you to do two very important things when extracting data from unstructured documents.

1)

@@ Line 8: / Line 8: @@
 Paragraph Marking is part of Grooper's Natural Language Processing (NLP) solution.  Normally, after text data is obtained from the [[Recognize]] activity, the carriage return and new line feed characters <code>\r\n</code> are inserted at the end of each line.  For structured documents, such as forms and reports, this is extremely useful.  Information is typically conveyed line by line.  These characters can be helpful anchors to locate and parse data.
+{|style="margin:auto" cellpadding=10
+|+For structured documents, like this report, important information is laid out line by line.
+[[File:Paragraph 1.png|center|border]]
+|}
 However, unstructured documents, such as contracts and correspondence, convey information differently.  Instead of information being broken up line by line, it is broken up ''paragraph by paragraph''.
+{|style="margin:auto" cellpadding=10 cellspacing=5
+|+For unstructured documents, like this contract, the same information is there.  However, it is embedded in paragraphs.
+[[File:Paragraph 2.png|center|border]]
+|}
+'''''Paragraph Marking''''' allows Grooper to alter the <code>\r\n</code> pairs within the text data.  Instead of being placed at the end of each line, they are placed at the end of each paragraph.
+{|style="margin:auto; text-align:center" cellpadding=10 cellspacing=5
+|Instead of \r\n pairs marking the end of each line...||...they mark the end of each paragraph.
+|-
+|[[File:Paragraph 3.png|center]]||[[File:Paragraph 4.png|center]]
+|}
+Enabling the '''''Paragraph Marking''''' property first detect paragraphs in a document then remove all <code>\r\n</code> pairs ''except'' the pair at the last line of the paragraph.  Furthermore, the removed <code>\r\n</code> pairs are replaced with a single space character to keep the normal text flow intact.  That way the word at the end of a line will have a space between it and the word at the beginning of the next line.
+This allows you to do two very important things when extracting data from unstructured documents.
+)