Glossary: Difference between revisions

From Grooper Wiki
// via Wikitext Extension for VSCode
Tag: Manual revert
// via Wikitext Extension for VSCode
Line 502: Line 502:
<section begin="Detect Signature" />
<section begin="Detect Signature" />
<section end="Detect Signature" />
<section end="Detect Signature" />
=== Field Match ===
<section begin="Field Match" />
<section end="Field Match" />


=== Find Barcode ===
=== Find Barcode ===
<section begin="Find Barcode" />
<section begin="Find Barcode" />
<section end="Find Barcode" />
<section end="Find Barcode" />
=== GPT Complete ===
<section begin="GPT Complete" />
<section end="GPT Complete" />


=== Highlight Zone ===
=== Highlight Zone ===
<section begin="Highlight Zone" />
<section begin="Highlight Zone" />
<section end="Highlight Zone" />
<section end="Highlight Zone" />
=== Label Match ===
<section begin="Label Match" />
<section end="Label Match" />


=== Labeled OMR ===
=== Labeled OMR ===
Line 530: Line 542:
<section begin="Pattern Match" />
<section begin="Pattern Match" />
<section end="Pattern Match" />
<section end="Pattern Match" />
=== Query HTML ===
<section begin="Query HTML" />
<section end="Query HTML" />


=== Read Barcode ===
=== Read Barcode ===
<section begin="Read Barcode" />
<section begin="Read Barcode" />
<section end="Read Barcode" />
<section end="Read Barcode" />
=== Read Meta Data ===
<section begin="Read Meta Data" />
<section end="Read Meta Data" />


=== Read Zone ===
=== Read Zone ===
<section begin="Read Zone" />
<section begin="Read Zone" />
<section end="Read Zone" />
<section end="Read Zone" />
=== Reference ===
<section begin="Reference" />
<section end="Referencee" />


=== Word Match ===
=== Word Match ===

Revision as of 14:00, 24 April 2024

Activity

Activity is a property on Batch Process Step objects. Activities define specific document processing operations done to a Batch, Batch Folder, or Batch Page.

Batch Process Steps configured with specific Activities are frequently referred by the name of the Activity followed by the word "step". For example: Classify Step.


Classify

Classify is an Activity that "classifies" Batch Folders in a Batch by assigning them a Content Type using patterns, lexical understanding, or rules as defined by a Content Model.


Clip Frames

The Clip Frames Activity extracts defined areas from microfiche card images, creating new image frames or layers for focused analysis or processing.


Detect Frames

The Detect Frames Activity locates and identifies frame lines on microfiche card images, enabling the isolation of areas within the frames for further data extraction or processing.


Execute

The Execute Activity runs a specified child command, allowing for the modular and controlled execution of tasks within a larger automated workflow.


Export

The Export Activity facilitates the transfer of documents and extracted information to external systems or formats, completing the data processing workflow.


Extract

The Extract Activity retrieves relevant information, defined by Data Elements, from Batch Folders, transforming unstructured or semi-structured content into structured, usable data.


Image Processing

The Image Processing Activity enhances and optimizes Batch Pages for better recognition and data extraction results.


Initialize Card

The Initialize Card Activity prepares and configures microfiche card images for further processing.


Recognize

The Recognize Activity interprets Batch Pages and Batch Folders, converting them into machine-readable text and capturing layout data for comprehensive analysis and data extraction. This will attach a text and/or layoutData file to the respective object.


Render

The Render Activity normalizes electronic document content from file formats Grooper cannot read innately to a PDF format. This allows Grooper to extract the text via the Recognize Activity.


Review

The Review Activity facilitates human evaluation and validation of processed Batch Folders and extracted data for accuracy and completeness.


Send Mail

The Send Mail Activity automates the dispatch of emails with or without attachments, based on Batch Process events and conditions.


Separate

The Separate Activity sorts Batch Pages into individual Batch Folders, distinguishing them for independent processing and organization.


Split Pages

Multi-page documents (typically PDFs and TIFFs) come into Grooper represented as single Batch Folders. The Split Pages Activity exposes Batch Pages as child objects of the Batch Folders for individualized processing and handling.


XML Transform

The XML Transform Activity applies XSLT stylesheets to XML data to modify or reformat the output structure for various purposes.


Application

A Grooper repository consists of a series of tables in a database, and a File Store containing relevant files associated to objects that exist within that database. An Grooper application is the interface by which a user can interact with that repository of information in an intuitive way.

Grooper Command Console

The Grooper Command Console is a command-line interface that performs system configuration and administration tasks within Grooper.

Web Client

The Grooper Web Client allows users to connect to Grooper via a web browser using a URL. The URL is pointed at a website hosted by a server on which Grooper is installed and Internet Information Services configured.

Behavior

Behaviors is a property of Content Types and Export Activities that defines configurable actions that automate processing tasks based on the identified Content Type of a Batch Folder.


Export Behavior

An Export Behavior defines the conditions and actions for exporting Batch Folders and their associated data from Grooper to other systems.


Labeling Behavior

A Labeling Behavior is a Content Type Behavior designed to collect and utilize a document's field labels in a variety of ways. This includes functionality for Classification and Extraction.


PDF Data Mapping

PDF Data Mapping is a Content Type Behavior designed to create an exportable PDF file with additional native PDF elements.


CMIS Connection Type

CMIS Connection Type, or "binding", establishes the communication protocols used to connect Grooper with content management systems adhering to the CMIS standard.


AppXtender

The AppXtender CMIS Connection Type, or "binding", connects Grooper to the ApplicationXtender content management system for import and export operations.


Box

The Box CMIS Connection Type, or "binding", connects Grooper to the Box content management system for import and export operations.


Exchange

The Exchange CMIS Connection Type, or "binding", connects Grooper to the Microsoft Exchange Server mail server for import and export operations.


FTP

The FTP CMIS Connection Type, or "binding", connects Grooper to FTP directories for import and export operations.


IMAP

The IMAP CMIS Connection Type, or "binding", connects Grooper to email messages and folders through an IMAP email server.


NTFS

The NTFS CMIS Connection Type, or "binding", connects Grooper to files and folders in the Microsoft Windows NTFS file system.


OneDrive

The OneDrive CMIS Connection Type, or "binding", connects Grooper to Microsoft OneDrive cloud services.


SFTP

The SFTP CMIS Connection Type, or "binding", connects Grooper to SFTP directories for import and export operations.


SharePoint

The SharePoint CMIS Connection Type, or "binding", connects Grooper to Microsoft SharePoint, providing access to content stored in "document libraries" and "picture lLibraries".


Classification Method

The Classification Method property determines the technique used for document classification within a Content Model, enabling the sorting of Batch Folders into categories based on their content or structure. It can utilize pattern matching, machine learning models, or other methodologies to identify and organize documents accurately.


Labelset-Based

Labelset-Based is a Classification Method that leverages the labels defined via a Labeling Behavior to classify Batch Folders.


Lexical

The Lexical Classification Method classifies Batch Folders based on their text content by utilizing either pre-configured training or rules. This is achieved through the analysis of word frequencies or defined rules that identify document types .


Rules-Based

The Rules-Based Classification Method employs defined "rules" on Document Types to classify Batch Folders, utilizing Positive Extractor and Negative Extractor properties to accurately categorize them through rule application, thereby ensuring Batch Folders match predefined criteria .


Visual

The Visual Classification Method uses image data instead of text data to determine the Document Type assigned to a Batch Folder during classification. Instead of using text-based extractors, an IP Profile is used with an Extract Features IP Command to obtain data pertaining to a Batch Folder's image(s). Document samples are trained as examples of a Document Type.


Collation Provider

The Collation Provider property of a Data Type defines the method for converting its raw results into a final result set, governing how lists of matches from the Data Type are combined and interpreted to produce the output data of the Data Type.


AND

The AND Collation Provider of a Data Type returns results only when each individual extractor specified within it gets at least one hit, thus acting as a logical “AND” operator across multiple extractors .


Array

The Array Collation Provider of a Data Type matches a list of values arranged in horizontal, vertical, or flow order, combining instances that qualify into a single result .


Combine

The Combine Collation Provider of a Data Type combines instances from returned results based on a specified grouping, controlling how extractor results are assembled together for output.


Key-Value List

The Key-Value List Collation Provider of a Data Type matches instances where a key and a list of one or more values appear together on the document, adhering to a specific layout pattern .


Key-Value Pair

The Key-Value Pair Collation Provider of a Data Type matches instances where a key is paired with a value on the document in a specific layout, essential for extracting label-value pairs .


Ordered Array

The Ordered Array Collation Provider of a Data Type finds sequences of values where one result is present for each extractor, in the order they appear .


Pattern-Based

The Pattern-Based Collation Provider of a Data Type uses regular expressions to sequence returned results into a final result set.


Split

The Split Collation Provider of a Data Type separates a data instance at each match returned by the Data Type.


Concept

There are many objects and properties a user can configure in Grooper, however, gaining an understanding how, why, and when to use these objects and properties is powered by one's understanding of the underlying concepts that define what what these objects and properties are doing and why.


Activity Processing

Activity Processing is a conceptual term that refers to the execution of a sequence of configured tasks, such as classification, extraction, or data enhancement on documents, which are performed within a Batch Process to transform raw data from documents into structured and actionable information.


CMIS+

CMIS+ is a conceptual term that refers to Grooper's CMIS+ architecture that provides a standardized access to document content and metadata across a variety of external storage platforms.


CMIS

CMIS is a conceptual term that refers to Content Management Interoperability Services: an open standard allowing different content management systems to share information over the Internet.


CMIS Query

CMIS Query is a conceptual term that refers to the fact that CMIS Queries are utilized to search documents in CMIS Repositories and to filter documents upon import when using the Import Query Results Import Provider.


CSS Data Viewer Styling

CSS Data Viewer Styling is a conceptual term that refers to the idea that the Grooper Web Client's Data View task view of the Review interface is styled using CSS. This gives you a great deal of control over a Data Model's appearance and layout during document review.


Classification

Classification is a conceptual term that refers to the process of identifying and organizing documents into categorical types based on their content or layout, often using machine learning, rules, or pattern recognition for efficient document management and data extraction workflows. Specifically, the Classify Activity will assign a Content Type to a Batch Folder.


Code Expressions

Code Expressions (not to be confused with regular expressions) is a conceptual term that refers to snippets of VB.Net code that expand Grooper’s core functionality.


Combined Methods

Combined Methods is a conceptual term that refers to the idea that a user can leverage multiple Classification Methods to overcome the shortcomings of an individual method.


Content Type

Content Type is a conceptual term that refers to the grouping of three Grooper objects: Content Models, Content Categories, and Document Types.


Data Context

Data Context is a conceptual term that gives definition to data that, without it, is otherwise meaningless.


Data Element

Data Element is a conceptual term that refers to the grouping of five Grooper objects: Data Models, Data Sections, Data Fields, Data Tables, and Data Columns.


Data Extractor

Data Extractor is a conceptual term that refers to the grouping of all extractor types and extractor objects.


Data Instance

Data Instance is a conceptual term that refers to an encapsulation of text data within a document. Data instances are the hierarchy of text data that Grooper's extraction mechanisms create.


EDI Integration

EDI Integration is a conceptual term that refers to Grooper's ability to process EDI files.


Expressions

Expressions (not to be confused with regular expressions) is a conceptual term that refers to snippets of VB.Net code that expand Grooper’s core functionality.


Expressions Cookbook

Expressions Cookbook is a conceptual term that refers to a reference list for commonly used expressions in Grooper.


Field Mapping

Field Mapping is a conceptual term that refers to how logical connections are made between metadata content in Grooper and an external storage platform.


Five Phases of Grooper

Five Phases of Grooper is a conceptual term that seeks to build understanding of how documents are processed through Grooper.


Flow Collation

Flow Collation is a conceptual term used to define a type of layout used in Collation Providers of Data Types.


Footer Rows and Footer Modes

Footer Rows and Footer Modes is a conceptual term that refers to how a "footer row" (enabled by the Generate Footer Row property of a Data Table) provides Grooper users a quick way to validate numerical data in a Data Column. The Data Column's Footer Mode property controls if and how a total is determined for numerical values in a Data Column.


Fuzzy RegEx

Fuzzy RegEx is a conceptual term that refers to the usage of fuzzy logic within extractor types that leverage regular expressions to match patterns via the enabling of the Fuzzy Matching' property.


GPT Integration

GPT Integration is a conceptual term that refers to the usage of OpenAI's GPT models within Grooper to enhance the capabilities of data extractors, classification, and lookups.


Grooper Infrastructure

Grooper Infrastructure is a conceptual term that refers to computing underpinnings of what makes up a Grooper repository and the software that allows interface with it.


Grooper Repository

Grooper Repository is a conceptual term that refers to the environment used to create, configure and execute objects in Grooper. It provides the framework to "do work" in Grooper.


Grooper Service

Grooper Service is a conceptual term that refers to the various executable programs that run as a Windows Services to facilitate Grooper processing. Service instances are installed, configured, started and stopped using Grooper Config.


Image Processing

Image Processing is a conceptual term that refers to how Grooper applies a variety of techniques to enhance scanned documents' quality, improving OCR accuracy by removing imperfections and adjusting visual characteristics to prepare images for data extraction and classification.


Import Mode and Document Linking

Import Mode and Document Linking is a conceptual term that refers to the usage of the Import Mode property. This affects whether or not an imported document maintains a link to its original file and/or if a copy of the file is made on import or not.


LINQ to Grooper Objects

LINQ to Grooper Objects is a conceptual term that refers to the ability of Grooper to leverage LINQ syntax in expressions.


Layered OCR

Layered OCR is a conceptual term that refers to the usage of the Layered OCR setting of the OCR Engine property of an OCR Profile. The use of this setting enables the usage of secondary OCR Profiles on a single page. The OCR results from these secondary OCR Profiles are merged with (or layered on top of) the primary OCR Profile's results.


Layout Data

Layout Data is a conceptual term that refers to information such as line locations, OMR checkbox locations and states, barcode values, and detected shapes captured by certain image processing commands. This data is stored as an attached file on a Batch Folder or Batch Page object and can later be recalled by various functions within Grooper that rely on the presence of that data to function.


Microfiche Processing

Microfiche Processing is a conceptual term that refers to how Grooper leverages several IP Commands to accurately process microform documents.


Microsoft Office Integration

Microsoft Office Integration is a conceptual term that refers to Grooper's ability to convert Microsoft Word and Microsoft Excel files into formats that Grooper can read.


OCR

OCR is a conceptual term that stands for Optical Character Recognition. It allows text from paper documents to be digitized, in order to be searched or edited by other software applications. OCR converts typed or printed text from digital images of physical documents into machine readable, encoded text


OCR Synthesis

OCR Synthesis is a conceptual term that refers to Grooper's unique method of pre-processing and re-processing raw results from the OCR Engine to get better results out of it.


Object Nomenclature

Object Nomenclature is a conceptual term that refers to the idea that mastery of a Grooper environment is greatly enhanced by understanding the myriad of objects that can exist and how they are related.


PDF Page Types

PDF Page Types is a conceptual term that refers to specific types of PDF pages. Page types describe the kind of content in a PDF page and informs Grooper how certain Activities should process the page. For example, "single image" pages are OCR'd by the Recognize activity where "text only" pages have their native text extracted.


Regular Expression

Regular Expression is a conceptual term that refers to a standard syntax designed to parse text strings. This is a way of finding information in a block of text. It is the primary method by which Grooper extracts and returns data from documents.


Repository

Repository is a conceptual term that refers to a location where files and/or data is stored and managed.


Separation

Separation is a conceptual term that refers to the process of taking an unorganized Batch of loose Batch Pages and organizing them into document folders. This is done so Grooper can later assign a Document Type to each document folder in a process known as Classification.


TF-IDF

TF-IDF is a conceptual term that refers to (term frequency-inverse document frequency), a numerical statistic intended to reflect how important a word is to a document within a collection (or document set or corpus). It is how Grooper uses machine learning for training-based document classification (via the Lexical method) and data extraction (via the Field Class extractor).


Table Extraction

Table Extraction is a conceptual term that refers to Grooper's functionality to extract data from cells in tables. This is accomplished by configuring the Data Table and its child Data Column Data Elements in a Data Model.


Test Batch

Test Batch is a conceptual term that refers to any Batch created in the Test folder of the Batches folder in the Node Tree).


Thread

Thread is a conceptual term that refers to the smallest unit of processing that can be performed within an operating system.


Training-Based Approaches to Document Classification

Training-Based Approaches to Document Classification is a conceptual term that refers to an approach to document classification that classifies Batch Folders according to the similarity of unclassified Batch Folders to trained examples of that kind of Document Type.


Training Batch

Training Batch is a conceptual term that refers to a more convenient way to work with all of the samples a Concent Model has been trained against. You can also still look at the Form Types underneath each Content Type, but the Training Set can show you all the samples in one place.


UNC Path

UNC Path is a conceptual term that refers to UNC (Universal Naming Convention) which is a standard used in Microsoft Windows for accessing shared network folders.


URL Endpoints for Review

URL Endpoints for Review is a conceptual term that refers to three URL endpoints that can be used to open Review tasks in the Grooper Web Client, given certain information like the Grooper Repository ID, Batch Process name, Batch Id and more.


Waterfall Classification

Waterfall Classification is a conceptual term that refers to a classification notion in Grooper that manipulates the Positive Extractor property to prioritize training similarity in order to achieve a middle ground between high specificity and accuracy, and generality with minimal accuracy. This is helpful whenever Batch Folders get misclassified, and simply retraining won't help.


XML Schema Integration

XML Schema Integration is a conceptual term that refers to Grooper's ability to interact with XML schemas and the configuration required to do so.


Export Definition

Export Definitions is a property of Export Behaviors as defined on Content Types or Export Activities. It defines export connectivity to external systems such as file systems, content management repositories, databases, mail servers, etc.


CMIS Export

CMIS Export is an Export Definition available when configuring an Export Behavior. It exports content over a CMIS Connection, allowing users to export documents and their metadata to various on-premise and cloud-based storage platforms.


Data Export

Data Export is an Export Definition available when configuring an Export Behavior. It exports extracted document data over a Data Connection, allowing users to export data to a Microsoft SQL Server or ODBC compliant database.


Extractor Type


Detect Signature


Field Match


Find Barcode


GPT Complete


Highlight Zone


Label Match


Labeled OMR


Labeled Value


List Match


Ordered OMR


Pattern Match


Query HTML


Read Barcode


Read Meta Data


Read Zone


Reference


Word Match


Zonal OMR


IP Command


Barcode Detection


Binarize


Extract Page


Line Removal


Scratch Removal


Shape Detection


Shape Removal


Import Provider


CMIS Import


Import Descendants


Import Query Results


Lookup


CMIS Lookup


Database Lookup


Web Service Lookup


Object


Batch


Batch Folder


Batch Page


Batch Process


CMIS Connection


CMIS Repository


Content Category


Content Model


Data Connection


Data Field


Data Model


Data Rule


Data Section


Data Table


Data Type


Document Type


Field Class


File Store


Form Type


IP Profile


Lexicon


Machine


OCR Profile


Object Library


Page Type


Processing Queue


Project


Review Queue


Scanner Profile


Separation Profile


Value Reader


Property


Confidence Multiplier and Output Confidence


Constrained Wrap


Content Type Filter


OCR Engine


Output Extractor Key


Paragraph Marking


Permission Sets


Scope


Secondary Types


Tab Marking


Vertical Wrap


Section Extract Method


Nested Table


Transaction Detection


Separation Provider


Separation Provider


Change in Value Separation


Control Sheet Separation


EPI Separation


ESP Auto Separation


Event-Based Separation


Multi Separator


Pattern-Based Separation


Undo Separation


Service


API Services


Activity Processing


Grooper Licensing


Table Extract Method


Delimited Extract


Fluid Layout


Grid Layout


Row Match


Tabular Layout


UI Element


Document Viewer


Node Tree


Overrides


Summary Tabs