Layout Data (Concept): Difference between revisions

From Grooper Wiki
No edit summary
No edit summary
Line 3: Line 3:
<blockquote>{{#lst:Glossary|Layout Data}}</blockquote>
<blockquote>{{#lst:Glossary|Layout Data}}</blockquote>


=== About ===
== Glossary ==
<u><big>'''Barcode Detection'''</big></u>: {{#lst:Glossary|Barcode Detection}}
 
<u><big>'''Batch Folder'''</big></u>: {{#lst:Glossary|Batch Folder}}
 
<u><big>'''Batch Page'''</big></u>: {{#lst:Glossary|Batch Page}}
 
<u><big>'''Batch'''</big></u>: {{#lst:Glossary|Batch}}
 
<u><big>'''Box'''</big></u>: {{#lst:Glossary|Box}}
 
<u><big>'''Collation Provider'''</big></u>: {{#lst:Glossary|Collation Provider}}
 
<u><big>'''Extract'''</big></u>: {{#lst:Glossary|Extract}}
 
<u><big>'''Find Barcode'''</big></u>: {{#lst:Glossary|Find Barcode}}
 
<u><big>'''Grooper Repository'''</big></u>: {{#lst:Glossary|Grooper Repository}}
 
<u><big>'''Image Processing'''</big></u>: {{#lst:Glossary|Image Processing}}
 
<u><big>'''Image Processing'''</big></u>: {{#lst:Glossary|Image Processing}}
 
<u><big>'''IP Command'''</big></u>: {{#lst:Glossary|IP Command}}
 
<u><big>'''IP Profile'''</big></u>: {{#lst:Glossary|IP Profile}}
 
<u><big>'''Key-Value Pair'''</big></u>: {{#lst:Glossary|Key-Value Pair}}
 
<u><big>'''Layout Data'''</big></u>: {{#lst:Glossary|Layout Data}}
 
<u><big>'''Line Removal'''</big></u>: {{#lst:Glossary|Line Removal}}
 
<u><big>'''Node Tree'''</big></u>: {{#lst:Glossary|Node Tree}}
 
<u><big>'''OCR'''</big></u>: {{#lst:Glossary|OCR}}
 
<u><big>'''Recognize'''</big></u>: {{#lst:Glossary|Recognize}}
 
<u><big>'''Shape Detection'''</big></u>: {{#lst:Glossary|Shape Detection}}
 
<u><big>'''Shape Removal'''</big></u>: {{#lst:Glossary|Shape Removal}}
 
<u><big>'''Tab Marking'''</big></u>: {{#lst:Glossary|Tab Marking}}
 
<u><big>'''Table Extract Method'''</big></u>: {{#lst:Glossary|Table Extract Method}}
 
<u><big>'''Table Extraction'''</big></u>: {{#lst:Glossary|Table Extraction}}
 
== About ==
The following '''IP Commands''' create layout data:
The following '''IP Commands''' create layout data:



Revision as of 10:15, 10 May 2024

STUB

This article is a stub. It contains minimal information on the topic and should be expanded.

Layout Data refers to visual information Grooper certain IP Commands collect, such as lines, checkboxes, barcodes, and detected shapes. This data is stored in a "Grooper.Layout.json" file attached to contract Batch Pages. Layout data is used by certain extractors and other features that rely on the presence of that data to function.

Glossary

Barcode Detection: Barcode Detection is an IP Command that detects and reads barcode data. The detected barcode information is stored as part of the page's layout data.

Batch Folder: The folder Batch Folder is an organizational unit within a inventory_2 Batch, allowing for a structured approach to managing and processing a collection of documents. Batch Folder nodes serve two purposes in a Batch. (1) Primarily, they represent "documents" in Grooper. (2) They can also serve more generally as folders, holding other Batch Folders and/or contract Batch Page nodes as children.

  • Batch Folders are frequently referred to simply as "documents" or "folders" depending on how they are used in the Batch.

Batch Page: contract Batch Page nodes represent individual pages within a inventory_2 Batch. Batch Pages are created in one of two ways: (1) When images are scanned into a Batch using the Scan Viewer. (2) Or, when split from a PDF or TIFF file using the Split Pages activity.

  • Batch Pages are frequently referred to simply as "pages".

Batch: inventory_2 Batch nodes are fundamental in Grooper's architecture. They are containers of documents that are moved through workflow mechanisms called settings Batch Processes. Documents and their pages are represented in Batches by a hierarchy of folder Batch Folders and contract Batch Pages.

Box: Box is a connection option for cloud CMIS Connections. It Grooper to the Box content management system for import and export operations.

Collation Provider: The Collation property of a pin Data Type defines the method for converting its raw results into a final result set. It is configured by selecting a Collation Provider. The Collation Provider governs how initial matches from the Data Type's extractor(s) are combined and interpreted to produce the Data Type's final output.

Extract: export_notes Extract is an Activity that retrieves information from folder Batch Folder documents, as defined by Data Elements in a data_table Data Model. This is how Grooper locates unstructured data on your documents and collects it in a structured, usable format.

Find Barcode: Find Barcode is a Value Extractor that searches for and returns barcode values previously stored in a folder Batch Folder or contract Batch Page's layout data.

  • Note: Find Barcode differs slightly from Read Barcode. Read Barcode performs barcode recognition when the extractor executes. Find Barcode can only look up barcode data stored in the document or page's layout data. Find Barcode runs quicker than Read Barcode, but barcode values must have previously been collected in the Batch Process by the Image Processing or Recognize activities.

Grooper Repository: A Grooper Repository is the environment used to create, configure and execute objects in Grooper. It provides the framework to "do work" in Grooper. Fundamentally, a Grooper Repository is a connection to a database and file store location, which store the node configurations and their associated file content. The Grooper application interacts with the Grooper Repository to automate tasks and provide the Grooper user interface.

Image Processing: wallpaper Image Processing is an Activity that enhances contract Batch Page images and optimizes them for better OCR text recognition and data extraction results.

Image Processing: wallpaper Image Processing is an Activity that enhances contract Batch Page images and optimizes them for better OCR text recognition and data extraction results.

IP Command: IP Commands specify an image processing (IP) operation (such as image cleanup, format conversion or feature detection) and are used to construct image IP Steps in an IP Profile. IP Commands are configured using an IP Step's Command property.

IP Profile: perm_media IP Profiles are a step-by-step list of image processing operations (IP Commands). They are used for several image processing related operations, but primarily for:

  1. Permanently enhancing an image during the Image Processing activity (usually to get rid of defects in a scanned image, such as skewing or borders).
  2. Cleaning up an image in-memory during the Recognize activity without altering the image to improve OCR accuracy.
  3. Computer vision operations that collect layout data (table line locations, OMR checkboxes, barcode value and more) utilized in data extraction.

Key-Value Pair: Key-Value Pair is a Collation Provider option for pin Data Type extractors. Key-Value Pair matches instances where a key is paired with a value on the document in a specific layout. Note: Key-Value Pair is an older technique in Grooper. In most cases, the Labeled Value extractor is preferable to Key-Value Pair collation.

Layout Data: Layout Data refers to visual information Grooper certain IP Commands collect, such as lines, checkboxes, barcodes, and detected shapes. This data is stored in a "Grooper.Layout.json" file attached to contract Batch Pages. Layout data is used by certain extractors and other features that rely on the presence of that data to function.

Line Removal: Line Removal is an IP Command that locates and removes horizontal and vertical lines from documents. The detected line locations are stored as part of page's layout data.

Node Tree: The Node Tree is the hierarchical list of Grooper node objects found in the left panel in the Design Page. It is the basis for navigation and creation in the Design Page.

OCR: OCR is stands for Optical Character Recognition. It allows text on paper documents to be digitized, in order to be searched or edited by other software applications. OCR converts typed or printed text from digital images of physical documents into machine readable, encoded text.

Recognize: format_letter_spacing_wide Recognize is an Activity that obtains machine-readable text from contract Batch Pages and folder Batch Folders. When properly configured with an library_booksOCR Profile, Recognize will selectively perform OCR for images and native-text extraction for digital text in PDFs. Recognize can also reference an perm_mediaIP Profile to collect "layout data" like lines, checkboxes, and barcodes. Other Activities then use this machine-readable text and layout data for document analysis and data extraction.

Shape Detection: Shape Detection is an IP Command that locates shapes on a document that match one or more sample images. Common shapes targeted by this command are stamps, seals, logos or other graphical marks that can serve as triggers for document separation or anchors for data extraction. Shapes The detected shapes' locations are stored as part of page's layout data.

Shape Removal: Shape Removal is an IP Command detects and removes shapes from documents. Common shapes targeted by this command are stamps, seals, logos or other graphical marks that interfere with OCR and/or can serve as triggers for document separation or anchors for data extraction. The detected shapes' locations are stored as part of page's layout data.

Tab Marking: Tab Marking allows you to insert tab characters into a document's text data.

Table Extract Method: A Table Extract Method defines the settings and logic for a table Data Table to perform extraction. It is set by configuring the Extract Method property of the Data Table.

Table Extraction: "Table Extraction" refers to Grooper's ability to extract data from cells in tables on documents. This is accomplished by configuring the table Data Table and its child view_column Data Column elements in a data_table Data Model.

About

The following IP Commands create layout data:

  • Line Detection and Line Removal
  • Box Detection and Box Removal
  • Barcode Detection and Barcode Removal
  • Shape Detection and Shape Removal

Any execution of an IP Profile with one or more of these commands will collect and store layout data. IP Profiles can be executed in one of two ways:

  1. The Image Processing activity.
  2. The Recognize activity.

The Image Processing activity will permanently alter a document's image. An IP Profile executed during Recognize will temporarily alter the document's image to clean it up before OCR, then revert back to the original image.

However, in either case, the layout data is collected and stored as a "LayoutData.json" file.

In most cases, these activities are applied at the page level, storing the "LayoutData.json file on the page object. However, in cases where Recognize is ran on the folder level, that file will be stored on the folder object.

If this information is used during data extraction and for whatever reason layout data was extracted at both the folder level and the page level, Grooper will always prioritize the layout data on the folder level. If the layout data on the folder is different from the layout data on the pages, Grooepr will ignore the page's layout data and go with what's on the folder.


Use Cases

WIP

This section needs expansion.

Examples in LayoutData.json

Once captured by the Image Processing or Recognize activities, Layout Data is saved as a .json file named "LayoutData.json" in the Grooper Repository as a companion file to processed Batch Folder or Batch Page. This file can be viewed by navigating to the Batch Folder or Batch Page object in the Node Tree, navigating to the "Advanced" tab, and examining the Files for that object. Currently there are four pieces of information that can be stored in this file, though this is likely to grow over time: Lines, OMR Checkboxes, Barcodes and Shapes.

Lines

Lines listed in the LayoutData.json file are broken out into Horizontal Lines and Vertical lines. For each line, X/Y coordinates will be listed for the start and stop points, noted as "ptA" and "ptB". The X/Y coordinates are measured in inches, in relation to the top left corner of the image, which is known as point 0,0. Below is an example of a Horizontal Line in the LayoutData.json file:

 "HorizontalLines": [
   {
     {
       "X1": 0.1867,
       "X2": 8.1533,
       "Y1": 0.2233,
       "Y2": 0.2233
     }
   }
 ]


OMR Checkboxes

Each OMR checkbox identified will be stored in the LayoutData.json file with X/Y coordinates for the bounding rectangle and an IsChecked boolean flag. The X/Y coordinates denote the location of the top-left and bottom-right corners of the corresponding OMR checkbox. The IsChecked flag indicates whether the corresponding OMR checkbox is filled in (checked). Below is an example of an OMR checkbox in the LayoutData.json file:

 "OmrBoxes": [
   {
     "Bounds": {
       "X1": 1.6467,
       "X2": 1.7567,
       "Y1": 1.3533,
       "Y2": 1.4633
     },
     "IsChecked": false
   }
 ]


Barcodes

Each Barcode identified will be stored in the LayoutData.json file with information regarding the barcode type (symbology), X/Y coordinates for the bounding rectangle, check some validation flag, confidence score, orientation, and value. The BarcodeType is an integer which represents a corresponding barcode symbology. As an example, a BarcodeType of 8 indicates that this Barcode uses the Code 128 symbology. The ChecksumIsValid flag is set to True if the barcode contains a valid checksum. The Orientation indicates the read direction of the barcode. As an example, an Orientation of 1 indicates the barcode was read in East orientation. The Value is the read value from the barcode. Below is an example of a Barcode in the LayoutData.json file:

 "Barcodes": [
   {
     "BarcodeType": 8,
     "Bounds": {
       "X1": 1.8467,
       "X2": 4.5167,
       "Y1": 1.0833,
       "Y2": 1.9333
     },
     "ChecksumIsValid": false,
     "Confidence": 1,
     "Orientation": 1,
     "Value": "Dummy Value Read from Barcode"
   }
 ]

Shapes

WIP

This section needs expansion.