2021:Vertical Wrap Detection: Difference between revisions

From Grooper Wiki
No edit summary
No edit summary
Line 5: Line 5:
<blockquote>{{#lst:Glossary|Vertical Wrap}}</blockquote>
<blockquote>{{#lst:Glossary|Vertical Wrap}}</blockquote>
<br clear=all>
<br clear=all>
== Glossary ==
<u><big>'''Batch'''</big></u>: {{#lst:Glossary|Batch}}
<u><big>'''Behavior'''</big></u>: {{#lst:Glossary|Behavior}}
<u><big>'''Content Model'''</big></u>: {{#lst:Glossary|Content Model}}
<u><big>'''Content Type'''</big></u>: {{#lst:Glossary|Content Type}}
<u><big>'''Data Field'''</big></u>: {{#lst:Glossary|Data Field}}
<u><big>'''Data Type'''</big></u>: {{#lst:Glossary|Data Type}}
<u><big>'''Document Type'''</big></u>: {{#lst:Glossary|Document Type}}
<u><big>'''Extract'''</big></u>: {{#lst:Glossary|Extract}}
<u><big>'''Extractor Type'''</big></u>: {{#lst:Glossary|Extractor Type}}
<u><big>'''Labeling Behavior'''</big></u>: {{#lst:Glossary|Labeling Behavior}}
<u><big>'''List Match'''</big></u>: {{#lst:Glossary|List Match}}
<u><big>'''Node Tree'''</big></u>: {{#lst:Glossary|Node Tree}}
<u><big>'''Ordered Array'''</big></u>: {{#lst:Glossary|Ordered Array}}
<u><big>'''Value Reader'''</big></u>: {{#lst:Glossary|Value Reader}}
<u><big>'''Vertical Wrap'''</big></u>: {{#lst:Glossary|Vertical Wrap}}
== About ==
== About ==



Revision as of 13:46, 10 May 2024

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

20252021


Glossary

Batch: inventory_2 Batch nodes are fundamental in Grooper's architecture. They are containers of documents that are moved through workflow mechanisms called settings Batch Processes. Documents and their pages are represented in Batches by a hierarchy of folder Batch Folders and contract Batch Pages.

Behavior: A "Behavior" is one of several features applied to a Content Type (such as a description Document Type). Behaviors affect how certain Activities and Commands are executed, based how a document (folder Batch Folder) is classified. They behave differently, according to their Document Type. This includes how they are exported (how Export behaves), if and how they are added to a document search index (how the various indexing commands behave), and if and how Label Sets are used (how Classify and Extract behave in the presence of Label Sets).

  • Each Behavior is enabled by adding it to a Content Type. They are configured in the Behaviors editor.
  • Behaviors extend to descendent Content Types, if the descendent Content Types has no Behavior configuration of its own.
    • For example, all Document Types will inherit their parent Content Model's Behaviors.
    • However, if a Document Type has its own Behavior configuration, it will be used instead.

Content Model: stacks Content Model nodes define a classification taxonomy for document sets in Grooper. This taxonomy is defined by the collections_bookmark Content Categories and description Document Types they contain. Content Models serve as the root of a Content Type hierarchy, which defines Data Element inheritance and Behavior inheritance. Content Models are crucial for organizing documents for data extraction and more.

Content Type: Content Types are a class of node types used used to classify folder Batch Folders. They represent categories of documents (stacks Content Models and collections_bookmark Content Categories) or distinct types of documents (description Document Types). Content Types serve an important role in defining Data Elements and Behaviors that apply to a document.

Data Field: variables Data Fields represent a single value targeted for data extraction on a document. Data Fields are created as child nodes of a data_table Data Model and/or insert_page_break Data Sections.

  • Data Fields are frequently referred to simply as "fields".

Data Type: pin Data Types are nodes used to extract text data from a document. Data Types have more capabilities than quick_reference_all Value Readers. Data Types can collect results from multiple extractor sources, including a locally defined extractor, child extractor nodes, and referenced extractor nodes. Data Types can also collate results using Collation Providers to combine, sift and manipulate results further.

Document Type: description Document Type nodes represent a distinct type of document, such as an invoice or a contract. Document Types are created as child nodes of a stacks Content Model or a collections_bookmark Content Category. They serve three primary purposes:

  1. They are used to classify documents. Documents are considered "classified" when the folder Batch Folder is assigned a Content Type (most typically, a Document Type).
  2. The Document Type's data_table Data Model defines the Data Elements extracted by the Extract activity (including any Data Elements inherited from parent Content Types).
  3. The Document Type defines all "Behaviors" that apply (whether from the Document Type's Behavior settings or those inherited from a parent Content Type).

Extract: export_notes Extract is an Activity that retrieves information from folder Batch Folder documents, as defined by Data Elements in a data_table Data Model. This is how Grooper locates unstructured data on your documents and collects it in a structured, usable format.

Extractor Type:

Labeling Behavior: A Labeling Behavior extends "label set" functionality to description Document Types. This allows you to collect field labels and other labels present on a document and use them in a variety of ways. This includes functionality for classification, field extraction, table extraction, and section extraction.

List Match: List Match is a Value Extractor designed to return values matching one or more items in a defined list. By default, the List Match extractor does not use or require regular expression, but can be configured to utilize regular expression syntax.

Node Tree: The Node Tree is the hierarchical list of Grooper node objects found in the left panel in the Design Page. It is the basis for navigation and creation in the Design Page.

Ordered Array: Ordered Array is a Collation Provider option for pin Data Type extractors. Ordered Array finds sequences of values where one result is present for each extractor, in the order they appear, according to a specified horizontal, vertical or text-flow layout.

Value Reader: quick_reference_all Value Reader nodes define a single data extraction operation. Each Value Reader executes a single Value Extractor configuration. The Value Extractor determines the logic for returning data from a text-based document or page. (Example: Pattern Match is a Value Extractor that returns data using regular expressions).

  • Value Readers are can be used on their own or in conjunction with pin Data Types for more complex data extraction and collation.

Vertical Wrap:

About

You may download and import the file below into your own Grooper environment (version 2021). This contains the Batch(es) with the example document(s) discussed in this article and the Content Model(s) configured according to its instructions.


Stacked labels are simply multi-word labels whose words are aligned vertically on multiple lines. In other words, they are "stacked" on top of each other. You can contrast this with simple labels which appear on a single line of the document.

In the before times (before version 2021), stacked labels presented somewhat of a challenge. For simple labels, the approach is, well, simple. We use regular expression to match the label. Do you want to match the label "ZIP CODE"? Your regex pattern is simply ZIP CODE.

However, for stacked labels, it's a little trickier. Regular expression matches a regex pattern against the entire document as one big text string. By itself, it doesn't have the capability to match labels stacked on top of each other because it just matches against the text flow character by character.

Instead, we had to use a Data Type, collated as an Ordered Array, using the Vertical Layout mode, looking for each line of the stacked label as the array elements, and usually specifying some minimum distance between the words in the label to throw out false positive results.

You can see here an example of how this was done.

  1. This is the parent Data Type (also the object we have selected in the Node Tree).
  2. The two child extractors return the results of each line.
  3. The Data Type is configured to use the Ordered Array option for its Collation, enabling Vertical Layout mode.
  4. The Data Type returns the label, looking for the word "ZIP" stacked on top of "CODE".


Seems like a lot of work to find the label "ZIP CODE", right?

Starting in version 2021, there is a much easier way of doing this through the Vertical Wrap property.

Currently, the Vertical Wrap property is accessible at two points in Grooper.

  1. When using the List Match Extractor Type.
  2. When collecting labels for Document Types utilizing a Labeling Behavior.

Vertical Wrap and List Match

At any point you can use the List Match Extractor Type you can enable vertical wrapping.

  1. Here, we've created and selected a Value Reader.
  2. We've set its Extractor Type to List Match.
  3. We have a single label in our Local Entries list of labels, ZIP CODE
  4. As you can see, it returns the simple label.
  5. However, it does not return the stacked label yet.

We can get both the simple and stacked label to match using the Vertical Wrap property. For the List Match Extractor Type, vertical wrapping is enabled using the Vertical Wrap property in the "Properties" tab.

  1. Navigate to the "Properties" tab.
  2. Change the Vertical Wrap property from Disabled to Enabled.
    • This property is found under the Options property heading.
  3. Now both the simple and stacked labels are matched and returned!

Furthermore, the Vertical Wrap property will match wrapped results of multiple variations.

Take for example the label "Purchase Order Number:" With Vertical Wrap enabled, this List Match extractor matches:

  1. The simple label on a single line.
  2. Both varieties of the stacked label on two lines.
  3. And the stacked label on three lines.

All of this done just by enabling the Vertical Wrap property, using its default configuration.

Labeling Behavior and Vertical Wrap

FYI Labeling Behavior is a Content Type Behavior that utilizes a document's labels for a variety of document processing purposes. For more information on the Labeling Behavior functionality, visit the Label Sets article

Vertical Wrap is enabled by default when adding the Labeling Behavior to a Content Model.

  1. Here, we've added the Labeling Behavior to our Content Model using its Behaviors property.
  2. As you can see, the Vertical Wrap property is Enabled by default.

  1. This allows for the simple collection of stacked labels, such as the label for this "PO Number" Data Field.
  2. The label is stacked on the document, "PO" on top of "Number"
  3. But label is matched successfully.
    • It is both highlighted on the document viewer and green in the label collection editor.

  1. Were the Vertical Wrap property to be disabled (by setting it from Enabled to Disabled), the label would no longer match.

  1. Now the label no longer matches. It is highlighted red in the label collection editor and is not highlighted on the document viewer.
    • The Vertical Wrap property is extremely useful when collecting stacked labels.
FYI

Note the Layout property is set to Simple.

Vertical Wrap was designed to work well with certain label layouts (notably, Simple and Ruled Layout options). Typically, the Simple Layout will work best when matching stacked labels. It is possible a different layout could have been used to force a match with Vertical Wrap disabled. However, that is not the intended use of the other layout options. For more information on label layouts, visit the Label Sets article.

Furthermore, Vertical Wrap should only be disabled in extremely rare circumstances when the vertical wrapping property produces false positive matches for labels, and the false positives cannot be resolved by adjusting the Vertical Wrap settings or other methods.