For example, OCR data is obtained from pages via the Recognize Activity. Processed documents are exported to a storage platform via the Document Export Activity.
Activities can be either "Attended" or "Unattended".
- Unattended Activities are automated and do not require human interaction. Classify, for example, is an unattended activity where documents are classified according to a Content Model..
- Attended Activities require human interaction. Classify Review, for example, is an attended activity where a user can verify Grooper's document classification performed during the Classify Activity.
Batches are the fundamental units of encapsulating documents in Grooper. They are a hierarchy of folders and pages used to represent documents and process them.
Batch objects in Grooper house two child objects.
- The root Batch Folder, containing a hierarchy of Batch Folders and Batch Pages.
- A read-only Batch Process, containing the list of processing instructions for the Batch Folders and Batch Pages
Batch Folders are part of the organizational hierarchy of Batches used to process documents in Grooper. They represent both "folders" as well as "documents" in a Batch.
First and foremost, Batch Folders are used to encapsulate and represent individual documents. A "document", from Grooper's perspective, is a folder with pages inside. It is a Batch Folder object with child Batch Page objects, one for each page of the document.
Secondarily, Batch Folders are used to create a folder hierarchy in a Batch, simply acting as folders containing other Batch Folders.
To avoid confusion, Batch Folders acting as documents are often referred to as "document folders".
Batch Page objects are created in two ways:
- During scanning, where a new Batch Page is created for each single-page image imported.
- Through the Content Action Activity, where a multipage PDF or TIFF is split into individual pages.
Batch Processes are the sequential set of processing instructions given to a Batch.
Batch Processes are comprised of Batch Process Steps. Each step performs a different operation, or Activity, one after the other. Each step is assigned a single Activity. Different activities perform different document processing functions, such as obtaining OCR text, classifying documents, or extracting data fields.
Batch Processes are highly configurable and reusable by multiple Batches.
Batch Process Steps are the individual processing instructions in a Batch Process.
Each step is assigned an Activity, which is performed on the batch during said step. Different activities perform different document processing functions, such as obtaining OCR text, classifying documents, or extracting data fields.
Batch Process Steps have further configuration considerations, including:
- At what level (or Scope) in the Batch it occurs (Batch, Batch Folder or Batch Page)
- How much processing power the Activity uses (done by assigning a Thread Pool)
- Conditional logic to determine if tasks are applied to individual folders or pages (via .NET expressions)
Binarize converts a color or grayscale image to black and white for further processing.
Binarize is an IP Command that converts a color or grayscale document image into black and white. This can be used as a basic method of cleaning up a document, rendering it black and white to produce better OCR results. This is accomplished by "thresholding" the image, setting a value that determines which pixels should be "white" and which should be "black". There are four thresholding methods available to Grooper: Simple, Auto, Adaptive, and Dynamic.
Binarization is also considered an "IP Primative" in that it is foundational to other image processing operations. Binarization is helpful (and sometimes required) to perform other IP Commands, such as Line Removal and Box Removal. These commands will have an ability to pre-process the image with its own binarization settings before final document binarization.
CMIS (Content Management Interoperability Services) is an open standard allowing different content management systems to share information over the Internet.
CMIS defines an abstraction layer for controlling diverse document management systems and repositories using web protocols. This allows different content management systems to connect to each other's repositories to import, access, and publish documents and their metadata.
CMIS+ provides standardized access to document content and metadata across a variety of external storage platforms All content platforms are exposed to Grooper under a single framework, called CMIS+, as if they were traditional CMIS endpoints by using CMIS Bindings.
This provides Grooper the necessary infrastructure to import, access, and publish documents and their metadata to both CMIS and non-CMIS storage platforms, including on-premise and cloud based platforms.
A CMIS Binding provides connectivity logic for external storage platforms, allowing CMIS Connection objects to import and export content.
Grooper's CMIS+ infrastructure expands connectivity from traditional CMIS servers to a variety of on-premise and cloud-based storage platforms by exposing connections to these platforms as CMIS Bindings. Each individual CMIS Binding contains the settings and logic required to exchange documents between Grooper and each distinct platform.
CMIS Bindings are used when creating a CMIS Connection object. The first step to creating a CMIS Connection is to configure the Connection Type property. First, the user selects which CMIS Binding they want to use, selecting which storage platform they want to connect to. The second step is to enter the connection settings for that binding, such as login information for many bindings.
CMIS Connections allow Grooper to connect to external storage platforms for import and export operations.
Grooper is able to connect to a variety of storage platforms, from simple file systems (such as Windows native NTFS system), to email sources, to full-fledged content management systems. The CMIS Connection object allows you to configure connection settings, allowing Grooper to integrate import and export control. With these settings saved on a Grooper object, they are easily referenced by an Import Watcher service or Document Export activity for automated batch processing.
Which platform is connected is defined by the CMIS Connection's Connection Type property. Once the connection is made to a specific storage location, called a "repository", Grooper has direct access to import and export documents from and to folder locations in the repository. Furthermore, Grooper has import and export access to metadata available to the particular platform. For example, sender and receipt metadata from emails received from email servers or the ability to map extracted data elements from documents to corresponding data element locations in a content management system.
CMIS Export is one of the Export Providers available to the Document Export activity. It exports content over a CMIS Connection, allowing users to export documents and their metadata to various on-premise and cloud-based storage platforms.
CMIS Exports can be either "Mapped" or "Unmapped".
- Unmapped Export is a simple export of files to folders. Metadata (such as data extracted from documents used to populate a Data Model) must be exported as a "buddy file", such as an XML or CSV file. This is appropriate for simple storage platforms such as the Windows NTFS file system.
- Mapped Export allows you to export files and their metadata to a repository that allows for metadata storage as well. Many content management systems allow for document storage as well as storing metadata in fields in the storage platform. This is done by pointing the extracted and available document metadata from Grooper to corresponding locations within the content management system. This is set up using the CMIS Content Type objects in the CMIS Repository object, mapping a connection between objects in a Content Model within Grooper (such as Data Fields in a Data Model) and their corresponding locations in the content management system (such as a column in a SharePoint site).
Documents are imported from CMIS Connections using either the Import Descendants or Import Query Results providers. These can be used in two ways:
- To perform manual "ad-hoc" imports when creating a new Batch in Grooper Dashboard or Grooper Design Studio.
- To perform automated, scheduled imports using one or more Import Watcher Grooper services.
Import Descendants will import all documents within a designated folder location of a CMIS Repository. Import Query Results allows you to use a query syntax similar to a SQL query (called a CMISQL query) to set conditions for import based on the item's available metadata, such as a documents name, file type, creation date, archive status, or other variables.
A CMIS Repository is a file storage location within a CMIS Connection. This could be as simple as folders in a NTFS (Windows file system) connection. This could be a storage location in a content management system, such as a cabinet in ApplicationExtender connection.
Before the CMIS Repository can be used for input/output in Grooper, it must first be "imported." This will create a CMIS Repository child object in the Node Tree under the CMIS Connection object. Importing, in this sense, does not bring the contents of the storage location into Grooper. Rather, it links the repository to Grooper (Documents themselves are imported using a CMIS Import).
Once the CMIS Repository is imported, connections to document metadata may also be set up to map to and from Grooper, such as mapping extracted fields to the connected platform. This is done using the CMIS Content Type child objects of the CMIS Repository object.
As an OCR engine attempts to recognize pixel clusters as characters, it will assign a value to it's decisions. This value represents how "confident" the engine is that the chosen character corresponds to what is on the image.
This is a percentage score from 1 to 100 on how sure the OCR engine believes the recognized character is correct.
As far as Grooper is concerned a document is a Batch Folder objects with Batch Page objects as its children. Before classification, the document (Batch Folder) is an unclassified or "blank" document. Grooper doesn't know what kind of document it is yet. Documents are classified by:
- Most often, the Classify activity using training data or rules set on a Content Model
- A Classify step will automate document classification in a Batch Process. During the Classify activity, Grooper will use information from the document and its pages (generally text) and configurations from a Content Model (such as the Classification Method used) to assign the document a Document Type from a Content Model.
- In some cases, the Separate activity by assigning a Document Type to each new folder created
- Manually assigning a Document Type by right-clicking a Batch Folder and using the "Apply Document Type" command.
Collation Providers allow Data Type extractor results to be combined, organized, or utilized in specific ways.
Results can be combined, organized into arrays, returned as a key-value pair's value, and more.
The following Collation Providers are available in Grooper:
- Key-Value Pair
- Key-Value List
- Ordered Array
This can be set to one of three modes:
- Multiple - Multiple instances can run concurrently. Multiple occurrences of the Activities using the Thread Pool can run at the same time
- PerMachine - Only a single instance can run on a single machine. Only one occurrence of an activity can run on a machine at a time.
- Single -Only a single instance can run per Grooper Repository. Regardless of how many machines are connected to the repository attempting to run an activity, only one occurrence of an activity using the Thread Pool can run at a time.
Content Categories are one of the Content Type objects of a Content Model. Content Categories are used to organize Document Types and create hierarchy levels in a Content Model for hierarchical data modeling.
A more complicated set of documents may exist, where some Document Types are a sub-category of a larger Document Type. This could just be useful for logically organizing Document Types into categories. Furthermore, this is part of creating a hierarchical structure in a Content Model. The Content Category, as a child of the parent Content Model, will inherit the Content Model's Data Model and its Data Elements. Document Types, as children of the Content Category, will inherit the parent Content Category's Data Model (including the Content Model's Data Model that it inherited).
For example, if a set of documents contains Invoices, Checks, Receipts and Purchase Orders, they might fit into a Content Category of "Payments" and other Document Types might be placed under other Content Categories. All of the Document Types in the "Payments" Content Category would inherit from both the Content Category and the Content Model itself.
A Content Model is the digital representation in Grooper of a document set's content. What content you want to glean from your documents is all set up within a Content Model, including the system for classifying documents and what data you want to extract from them.
- Document Classification
- Data Extraction
Content Models define the classification taxonomy for a set of documents. This means a list of distinct types of documents (via Document Types), their hierarchical structure within the Content Model (via optional Content Categories). How a document is classified is defined here as well (via the Classification Method and the Document Types).
Hand-in-hand with the classification taxonomy, Content Models also define the hierarchical data structure for the documents and document set (via Data Models of the various Content Types in the Content Model). The Data Models and their Data Elements define what data is extracted from documents and how that is accomplished.
Content Types are the building blocks of a Content Model. They are built to represent the content of a document set, both in terms of document classification and their data.
There are five Content Types:
The different Content Types are used to create a hierarchical structure within a Content Model with each Content Type, forming a different level of the document set's classification taxonomy. The Content Model forms the root of this hierarchy.
Control Sheets are printable pages used to automate various batch processing activities, such as separating documents. These pages are printed and placed at the beginning of a new batch or document in a stack of loose pages. They contain specialized barcodes called "patch codes", which Grooper can read and process its instructions.
Some of the actions Control Sheets can automate include:
- Setting configurations the Control Sheet Separation Separation Provider can read to separate scanned pages into documents.
- Control Sheets can also be configured to classify documents during this separation as well.
- Directing pages to an Image Processing activity, using an IP Profile assigned to the Control Sheet
- Indicating Batch creation, including assigning a Batch Process.
The order of the Data Columns in the Node Tree often reflect the column order of a table on a document. The top Data Column will correspond to the leftmost column. The bottom Data Column will correspond to the rightmost. However, this order is not required to correspond with the order on the document. This is also where you set the Header Extractor to find column headings if using the Header-Value Table Extraction method.
Data Connections are created to connect a Grooper Repository to an external database. The Data Connection stores all information needed to access the database. Once connected, Grooper has both read and write access to tables in the database (assuming you have those user rights in the database).
Grooper can connect to a Microsoft SQL Server or any ODBC-compliant provider. External database table information can be accessed in Grooper by importing a table reference. This will create a Database Table child under the Data Connection in the Node Tree.
Data Connections can be used to look up values in an external database table. They are also used to export extracted data from Grooper Data Elements to an existing database table. Furthermore, a new table in a connected database can be created from Grooper using Data Elements from a Data Model.
Data without context is meaningless. Context informs both what a piece of data refers to, as well as strategies to locate and return that data using extractors in Grooper.
Context is critical to understanding and modeling the relationships between pieces of information on a document. Without context, it’s impossible to distinguish one data element from another. Context helps us understand what data refers to or “means”.
There are three fundamental data context relationships:
- Syntactic - Context given by the syntax of data.
- Semantic - Context given by the lexical content associated with the data.
- Spatial - Context given by where the data exists on the page, in relationship with other data.
Using the context these relationships provide allows us to understand how to target data with extractors.
Data Elements are child objects of a Data Model. These objects are used to populate the Data Model with the data you wish to obtain from a document, such as field values on a form. They represent portions of the data extracted during the Extract activity.
Data Elements include:
Data Extractors are Grooper objects or property configurations used to isolate and return information from text data on a page.
Data extractors (or simply "extractors") are used in a variety of ways, including (but not limited to):
- Classify documents
- Find data on a page you wish to store outside of Grooper
- Separate documents
Data Fields are Data Elements added to a Data Model. Data Fields are the building blocks of a Data Model. A Data Field is a representation of a single piece of data targeted for extraction on a document.
For example, if you wanted to extract an invoice number from a collection of invoices, you would create an "Invoice Number" Data Field within a Data Model as its child. The extractor used to find that value on the document would be set on the Data Field, as well as other properties such as its ultimate format on output, how that field behaves during the Extract and Data Review activities (such as if the Data Field is required to be filled), and any .NET code based expressions used to validate or generate the field.
Data Formats are Data Extractors created as children of Data Type extractors. They are relatively simple extractors using a "Pattern Editor" to write a regular expression pattern to return a list of matching results.
Using the Pattern property of a Data Type, only a single regular expression pattern (and its property configuration) can be written. Data Formats can be leveraged to write multiple patterns. The Data Type will inherit the results each Data Format child returns. How the Data Type uses those results is determined by the Collation Provider assigned to its Collation property.
Data Review is an Attended Activity, allowing users to verify the results of an Extract activity.
The Data Review module is used as manual validation of Grooper's automated data extraction. Users can review each document and their extracted fields according to how they are set up in a Content Model's Data Model. If the extracted data does not match the information on the page, the user is able to manually enter the correct information.
Data Sections are Data Elements of a Data Model. They allow a document's content to be subdivided into smaller portions (or "sections") for further processing, yielding the extraction process higher efficiency and accuracy.
Often, they are used to extract repeating sections of a document. For example, if a document had several sections of data for different customers, a Data Section could be used to pull data for each customer. This is especially useful for situations where the data within the section is predictable, but the number of sections in the document is not (i.e. if one document has one customer's data listed but the next has five, the next has two, and so on and so on).
Data Sections can also be used to:
- Organize data from complex documents
- Make a hierarchical representation of a document's structure, or
- Reorder content from multiple columns on a page.
Data Sections may have, as its children:
Data is extracted using one of the following Table Extraction methods:
- Row Match - Using an extractor to match each row.
- Header-Value - Header values to find data for each column
- Infer Grid - Creating a grid based on header and row values
- See #Feature Tagging
The matching pattern (using the Data Type's Pattern property) or patterns (using child Data Formats or Data Types) will return as a list of values. The returned values can be further collated, isolated, and manipulated by configuring the properties of the Data Type. Data Types have a variety of uses in Grooper. Not only are they used to populate Data Fields in a Data Model with extracted text, but can be used to separate pages into document folders, classify documents, and more.
Database Export is an Activity used to export extracted data to a SQL or ODBC compliant database.
Database Lookups are used to populate or validate fields in a Data Model using an external database, such as a SQL database.
For validation, values in a Data Field are compared to corresponding values in a database table. If the results differ from what’s in the table, that field can be flagged.
For data population, as long as the "lookup values" exist in the external database, Grooper extracted values can be used to populate additional Data Fields. Given a certain field or fields in a Data Model matches fields in a database table, additional values in the table’s row can be assigned to empty Grooper Data Fields. For example, an extracted social security number from a document could be used to lookup a corresponding name in database that has both the social security number and name.
The Grooper data fields used for comparison against a database table are called "lookup fields". The fields populated from the database table are called "target fields". A SQL query is used to find the lookup fields and target fields in the database table.
A Database Table, in Grooper, is an object representing an external database table Grooper has connected to using a Data Connection.
External databases can then be queried from Grooper, accessing their data to populate or validate fields in a Data Model using Database Lookups during the Extract activity. Extracted data can also be exported to an external database as well via the Database Export activity. This is done by creating mappings between the external database's table columns, using the Database Table object and fields in a Data Model.
If a set contains Invoices, Checks, Receipts and Purchase Orders, there might be four Document Types, one for each kind of document. Classification, in Grooper, is the process of assigning Document Types to Batch Folders in a Batch.
The Grooper Document viewer is the portal to your documents.
Understanding the Document Viewer means knowing the functionality of its main toolset.
Document Viewer Toolbar
The Document Viewer Toolbar runs along the top of the pane on the right of the screen, just above the document you are viewing. (Some buttons do not apply to certain file types and will be hidden in those contexts.)
This lets the user re-position the document in the viewer typically as a result of it being at some level of zoom.
This lets users make a rectangular zoom selection and fits that selection to the display window.
This lets users click & drag to see a magnified overlay window, great for viewing character or image quality details close-up.
Transform and File System Tools
These tools allow the user to rotate the document in 90° in either a clockwise or counter-clockwise fashion.
Print and Save Page Tools
These tools allow a user to either print or save (as a PDF, among other image formats) the currently displayed page.
This drop-down lets the user select which rendition of the file to display. For OCR’ed documents, you can choose between the image or the OCR text. For PDF documents that have been Recognized, you can view the PDF in its Native Format or its Character Data renditions. XML-Based Word and Excel Documents (.docx and .xslx files from Office 2007 onward) can be viewed their component XML files, where users can see versioning, formatting, style, application, workbook, sheet, and other information within the document.
Page View Tools
Prior to Grooper 2.9 the Rendition Selector, and most of the image adjustment tools on pages did not exist.
Export Providers are used by the Document Export activity to define where and how Grooper content is exported.
Which Export Provider you choose will determine where your processed content in Grooper (including documents and their metadata) are exported. They allow you to define the connection type when exporting outside of Grooper, such as file systems, content management repositories, or mail servers.
The Legacy Export providers only exist to provide backwards compatibility. Their storage platform endpoints have CMIS Binding analogues suitable for CMIS Export. For example, the legacy Mail Export provider has a corresponding IMAP binding, allowing CMIS Export to export to IMAP mail servers. CMIS Export should be prioritized over Legacy Export going forward.
"Feature Tagging" (also known as "Data Tagging") is an informal term for the use of the Output Extractor Key property of the Individual Collation Provider. Enabling this property normalizes an extractor's result to a named token or "tag". This has practical applications to normalize data as features for document Classification or a Field Class's Feature Extractor.
For example, you may be able to classify a document based on the fact that it contains several dates. The individual dates don't matter, but the fact there are several dates on the document may be a way of distinguishing it from others. The Output Extractor Key property would normalize each individual result to a single named token or "tag". For example, if a Data Type named "Date" returns 50 individual dates, the each result would be transformed into the phrase "feature_date", resulting in a list of 50 returns of the phrase "feature_date".
File Stores are storage locations for content files associated with nodes in Grooper's Node Tree. This includes the scanned and imported images of Batch Page objects, the PDF versions of imported PDFs on Batch Folder objects, OCR results for Batch Pages, and scripting source code.
If you select an object in the Node Tree, then go to the "Advanced" tab, then go to the "Files" tab underneath each file listed is stored in a File Store.
File Stores can be any folder you have writable access to, but it is best practice to use a fully qualified UNC path.
Flow collation methods allow Data Type extractors to parse data using the the "flow" of text within a document. Text "flows" in its read direction. For English language, text is read (or flows) from the top of the page to the bottom, and from left to right.
This is particularly useful when processing natural language. The Flow property is available to the following Collation Providers:
Each document trained will create (or supplement) a child Form Type of its assigned Document Type. This is where all the training weightings for the extracted text features reside for Lexical document classification. Each page of the document is represented by Page Types, which are its children.
Fuzzy RegEx (also referred to as "fuzzy matching" or "fuzzy mode" or even just "fuzzy") allows regular expression patterns to match text within a set percentage of similarity. This can allow Grooper users to overcome unpredictable OCR errors when extracting data from documents.
Typically, regular expression will either match a string of text or it won't. If you're trying to match a word and the regex pattern is even a single character off from the text data, you will not return a result.
Fuzzy RegEx uses a Levenshtein distance equation to measure the difference between the regular expression and potential text matches. The percentage difference between the regex pattern and the matched text is expressed as a "confidence score" (also as a percentage). If the confidence is above a set threshold, the result is returned. If it is below the threshold, it is discarded.
For example, a text string that is 95% similar to the regex pattern may be off by just a single character. If the Minimum Similarity threshold is set to 90% the result would be returned, even though the pattern doesn't match the text exactly.
The Grooper Attended Client is part of the Grooper Product Suite. This program executes Attended Activities. These are Activities in a Grooper that requiring human interaction used to validate results from Grooper's automated Unattended Activities.
Data Review, for example, is an Attended Activity where a user can verify the data Grooper extracted from a document during the Extract activity actually matches up with what's on the document.
Grooper Config is part of the Grooper Product Suite. This program is used to configure settings specific to a server or workstation (whereas Grooper Design Studio is used to architect objects such as Content Models and Batch Processes used by all servers and workstations).
Grooper Config has two main purposes:
- Setting up new and connecting to existing Grooper Repositories.
- Setting up and configuring Grooper Services.
Grooper Dashboard is part of the Grooper Product Suite. Grooper Dashboard is a program for end-users to process documents, providing Batch management and reporting capabilities for production level Batches.
The instructions for how documents are processed (Batch Processes) created in Grooper Design Studio can be executed and managed in Grooper Dashboard. Batches can be created and assigned a Batch Process in Grooper Dashboard. Users can audit flagged documents and Batch Process Steps in an error state. Users can pause tasks in process. Grooper Dashboard is typically how users manually validate documents to ensure automated tasks performed correctly, using Attended Activities such as Classify Review or Data Review.
Grooper Design Studio is part of the Grooper Product Suite. Grooper Design Studio is an administrator level program for creating and configuring objects controlling how documents are processed.
Not only can it run these Reports, but it can run them in real time, giving users the ability to see Batch processing statistics as documents are processed from end to end. Grooper Kiosk is designed to run in full screen mode on a wall mounted monitor, giving all production level users in a document processing center insights from batch analytics.
Grooper Licensing is a Grooper Service that distributes licenses to workstations running Grooper applications. Except for stand-alone Grooper installs, a Grooper Licensing service is required for all deployment scenarios.
For client-server environments, a Grooper Licensing service is installed and started on the host server using Grooper Config. Then, a License Server object is created in Grooper Design Studio and configured with the host server's name and port number. Next, the license is activated using the "Activate Online" button and entering Grooper serial number, or, if necessary, imported from a .lic file using the "Import License Package" button.
Client machines can then be set up to access licensing information by creating their own License Server object, referencing the host server's name and port number (Note: You will also need to add the server's machine id to the Machines folder to accomplish this).
There are six main nodes in the Node Tree. These nodes serve the purpose of organizing Grooper objects to assist Node Tree navigation.
- Batch Processing Folder
- Content Models Folder
- Data Extraction Folder
- Global Resources Folder
- Infrastructure Folder
- Reports Folder
Every Grooper Repository will contain these nodes and their base child node contents. All other objects created, including (but not limited to) Batches, Content Models, and Data Extractors, will be created as children of one of these six nodes (or folder nodes within these nodes).
The Grooper Product Suite includes all programs that come installed with Grooper.
- Grooper Design Studio
- Grooper Dashboard
- Grooper Config
- Grooper Attended Client
- Grooper Unattended Client
- Grooper Kiosk
A Grooper Repository is the environment used to create, configure and execute objects in Grooper. It provides the framework to "do work" in Grooper.
This environment consists of a database connection and a file store location. The database stores Grooper nodes and their property settings (such as a Content Model or a Data Type or any other Grooper object). The file store location houses content associated with these nodes (such as the image file for a Batch Page object).
Creating (or connecting to) a Grooper Repository is one of the first things you will do after installing Grooper to start designing (or implementing already architected) document processing solutions. For more information on how to create or connect to a Grooper Repository, visit the Install and Setup article.
The Grooper Root is the top level (or Root Node) of the Node Tree in Grooper Design Studio. It contains all Grooper Node objects as its children. It also stores several settings that apply to the connected Grooper Repository, including the Active File Store location and License Server.
Informally, "Root Node" and "Grooper Root" are used interchangeably.
Grooper Services are various executable applications that run as a Windows Service to aid Grooper. Service Instances are installed, configured, started and stopped using the "Services" tab of Grooper Config.
For example, Import Watcher is a service that watches an assigned external storage location, like a Windows folder or an email inbox, and will automatically import its contents into Grooper for automated Batch creation.
Grooper Config must be ran as an administrator to install and configure Grooper Services.
The Grooper Unattended Client is part of the Grooper Product Suite. This program executes Unattended Activities. These are Activities in Grooper that are automated.
Header-Value is one of three methods available to Data Table elements to extract information from tables on a document set. It uses a combination of column header and column value extractors to determine the table’s structure and extract information from the table’s cells.
These operations generally serve one of three purposes:
- Image cleanup to improve the exported archival image
- Image cleanup to improve OCR results
- Image-based data collection, including Layout Data (such as table line locations, barcode information, and OMR checkbox states) as well as image features used for Visual classification.
IP Commands are permanently applied to a document when an IP Profile is executed during the Image Processing activity. This effects the archival result of the document's pages. IP Commands are temporarily applied to a document prior to OCR when an IP Profile is executed during the Recognize activity. This is useful for non-destructive image clean up to improve OCR results, keeping the document's pages as their original image to preserve their archival images upon export.
IP Groups are used to:
- Organize complex IP Profiles
- Create re-usable units of image processing which can be copied and pasted into other IP Profiles
- Conditionally execute or skip a sequence of IP Steps.
An IP Profile is a sequence of instructions for image processing. They are composed of IP Steps and IP Groups (which are themselves collections of IP Steps). Each IP Step contains an IP Command, which define image processing operations.
These operations generally fall into three categories:
- Archival Adjustments - These are permanent adjustments to the exported document's image.
- Permanent image adjustments are performed when an IP Profile is executed during the Image Processing activity.
- OCR Cleanup - Image cleanup can dramatically improve OCR results.
- However, they can also drastically alter the document's image. Image adjustments are temporarily applied to a document prior to OCR when an IP Profile is executed during the Recognize activity. This is useful for non-destructive image clean up to improve OCR results, keeping the document's pages as their original image to preserve their archival images upon export.
- Layout Data Collection - This includes visual information used for data extraction purposes (such as table line locations, barcode information, OMR checkbox states) as well as image features used for Visual classification.
- Layout Data can be collected either during the Image Processing or the Recognize activities.
IP Steps comprise the instructions listed in an IP Profile to perform image processing operations. IP Profiles are a sequential list of IP Steps. One IP Command is assigned and configured per IP Step.
Import Providers enable you to import files into Grooper from a variety of sources, such as file systems, mail servers, and other on-premise and cloud based document storage platforms.
Import Providers' can be used in two ways:
- To perform manual "ad-hoc" imports when creating a new Batch in Grooper Dashboard or Grooper Design Studio.
- To perform automatic, scheduled imports using one or more Import Watcher services.
The Import Mode property of an Import Provider allows you to control whether a document's content (i.e. the images and, in the case of PDF documents, their text), its properties, and a link between the document's source location and Grooper are created.
There are three Import Modes in Grooper:
- Full - This mode fully imports the document. Both their content and their properties are loaded into a Grooper Batch. Because the files are fully copied from the source into a Grooper environment, this is the slowest of the three import modes.
- Sparse - The Sparse import mode imports a document's properties as it does in Full mode. However, instead of fully importing the document's content, a link between Grooper and their content at the import source is created. Particularly when importing large document sets, this can greatly reduce the time it takes to import documents. If needed, the content can also be loaded in parallel using the Execute activity.
- LinkOnly - This mode only creates a link between the document and and the source.
Infer Grid is one of three Table Extraction methods to extract data from tables on documents. It uses the positional location of row and column headers to interpret where a tabular grid would be around each value in a table and extract values from each cell in the interpreted grid.
This method extracts information by inferring a grid from the row and column header positions. This is done by assigning an X Axis Extractor to match the column headers and, a Y Axis Extractor to match row headers. A grid is created from the header positions extracted from the two extractors.
Furthermore, if table line positions can be obtained from a Line Detection or Line Removal IP Command, only the X Axis Extractor is needed. In these cases, the X Axis Extractor can be used to find the column header labels, and the grid will be created using the table lines in the documents Layout Data. The raw text data obtained from the Recognize activity will populate each cell of the grid according to where it is on the page.
The parent node determines what type of object can be created as a child node the next level in the hierarchy underneath it. The parent also determines what information the child has access to and which traits, if any, it inherits from its parent. Certain objects can also pass information as children to their parents.
Layout Data is information such as line locations, OMR checkbox locations and states, barcode values, and detected shapes captured by certain Image Processing commands. This data can later be recalled by various functions within Grooper that rely on the presence of that data to function.
The Lexical Classification Method is one of three methods of classifying documents available to Grooper. This method classifies documents according to their text content, obtained from OCR or extracted native PDF text (via the Recognize activity). It uses a Training-Based Approach to "teach" Grooper to classify a document from trained examples of the Document Type.
A lexicon is a list of words, phrases, or other information. Grooper Lexicons are also text-based lists referenced in various ways by other objects.
Lexicons can be used to:
- Look up values during data extraction
- For example, an extractor could be set up to return first or last names from a Lexicon of common first or last names.
- Translate extracted values from one value to another
- For example, an extractor could be set up to look the full name of a company (ACME Document Corporation) in a Lexicon and translate the result to an abbreviated version (ADC)
- Assign weighting values for fuzzy matching
- Determine the frequency of values within a document set
- and more.
Before configuring a License Server object, at least one Grooper Licensing service needs to be created and running. This can be accomplished using the "Services" tab of the Grooper Config program. The license information obtained by the License Server is used by Grooper Activities that need to check out licenses. While it is possible to add multiple License Server objects to the License Servers folder in the Node Tree, running Grooper activities will use only the License Server set on the License Server property of the Grooper Root.
The following objects can be created in the Local Resources Folder:
Local Resources Folders can be created as children of the following Content Types:
Each Content Type object may house one Local Resource Folder as a child folder. However, one can make as many folders within the Local Resources Folder as desired.
Machine objects represent a server or workstation connecting to a Grooper environment.
Machines are automatically added to the Machines folder in the Node Tree when a connection to a Grooper Repository is established from Grooper Config. However, they can also be manually added and removed from the Machines Folder. For example, you may need to manually add a Machine in order to point to a server running a Grooper Licensing service to obtain licensing information.
Selecting a Machine displays a performance statistics panel, all Grooper Services installed from Grooper Config, and details about the server or workstation computer such as operating system and processing information.
Nodes pass information to and from each other through "inheritance". The parent node dictates inheritance. It controls what child objects can be created underneath it, as well as what information is passed to and from the parent and child nodes.
This is the hierarchical list of folders and other items (called objectss) found in the left panel when you first open Grooper Design Studio. It is the basis for object navigation and creation in Grooper Design Studio.
Adding an object creates a new "node" in the Node Tree. It has a "tree structure" in that it "branches." Each new node added under another creates a new "branch". It is added one level down from the node it is created. What objects can be added under a branch are determined by the branch above it.
OCR stands for Optical Character Recognition. It allows text from paper documents to be digitized, in order to be searched or edited by other software applications. OCR converts typed or printed text from digital images of physical documents into machine readable, encoded text
This conversion allows Grooper to search text characters from the image, providing the capability to process these documents and the information they contain.
An OCR Engine is the part of OCR software that does the actual character recognition, analyzing the pixels on an image and figuring out what characters they represent. This raw result can be further processed using Grooper's OCR Synthesis capabilities, producing the final OCR result used by Data Extractors to match text in a document and return the result.
The Transym OCR 4 and Transym OCR 5 engines are included in Grooper. Transym OCR 4 provides highly accurate English-only OCR while Transym OCR 5 provides multi-language OCR for 28 different languages. Version 2.72 and beyond also support Google's open source Tesseract OCR engine. The ABBY FineReader, Prime OCR, and Azure OCR engines are also supported but require separate installations and separate licensing.
An OCR Profile defines the settings for performing OCR.
- Setting which OCR Engine is used
- Determining whether a temporary IP Profile is used for image cleanup before the OCR engine runs
- Grooper's unique Synthesis settings
- Determining if and how multiple OCR results are pre-processed and re-processed
- If and how results are filtered, to toss out undesirable results.
- Any configurable settings available from the OCR Engine
The Synthesis functionality is Grooper's unique method of pre-processing and re-processing raw results from the OCR engine to get better results out of it.
Using Synthesis, portions of the document can be OCR'd independently from the full text OCR. Portions of the image dropped out from the first OCR pass can be re-run through the OCR engine. And, certain results can be reprocessed. The results from the Synthesis operation are combined with (or in some cases replace) the full text OCR results from the OCR Engine into a single text flow.
For example, "folder object" refers to any folder where other objects can be stored in the Node Tree, regardless where it exists in the Node Tree's hierarchy. A Content Model is a type of object. A Data Type is a type of object. A specific Data Type named "Date Extractor" designed to extract dates is a node in the Node Tree. Its node location in the Node Tree is important for determining its inheritance and referencing it. But, it will always be a Data Type object.
An Object Library is a .NET framework library containing custom objects used to customize the way Grooper works.
They can be used with Grooper's scripting functionality for a variety of tasks, including custom Activities, objects for customizing a Review client, custom services to automate work in Grooper's background, and custom reports.
A single page of a trained document is created automatically within Grooper as a child of a trained document's Form Type. If the trained document has five pages, the Form Type will have five Page Type children. They are used to store sample pages and their training data (the weightings of classification features).
Paragraph Marking alters the normal text data in a document by placing the carriage return and new line feed pairs at the end of each paragraph, instead of the end of each line. This allows users to break up a document's text flow into segments of paragraphs instead of segments of lines.
The Paragraph Marking property is enabled in the Preprocessing Options of the Pattern Editor window. The are several paragraph detection settings to determine what qualifies as a paragraph.
The parent node defines inheritance for children objects created underneath. It controls what child objects can be created underneath it, as well as what information is passed to and from the parent and child nodes.
These are meant to be "real world" batches for actual end-to-end document processing.
The Recognize activity obtains a document's content, including text and Layout Data, and saves it for use by further processing activities, such as document classification and data extraction.
Text data is critically important for most everything you do in Grooper. The Recognize' activity obtains text data from both scanned (or otherwise image-based) documents and digital documents with native machine-readable text already embedded. For image-based documents, the activity will execute an OCR Profile, performing OCR according to its settings. For digital documents, the activity will extract the embedded text.
The Recognize Activity can:
- get machine readable text from images by performing OCR on a page or embedded images in a PDF,
- extract native PDF text directly
- extract Layout Data or
- any combination at the document folder or page level.
Regular expression (or "regex") is a standard syntax designed to parse text strings. This is a way of finding information in a block of text. It is the primary method Grooper extracts and returns data from documents.
Using a standard syntax, a sequential line of characters is written to match a string of characters in the text. This line of characters written to match text is called a "pattern" and can potentially return multiple strings, not just one value. It will return any string of text matching the pattern.
This syntax can be used to match very specific strings of characters or written more generally to match several permutations of the pattern. For example, one can write a regular expression pattern to match a specific date or any date in a text block.
The Render activity normalizes electronic document content from file formats Grooper cannot read innately to a PDF format. This allows Grooper to extract the text via the Recognize activity.
- See #Report Instance
Reports display information collected from various Grooper operations.
Individual Reports are created as Report Instances in Grooper. What information the Report Instance displays is defined by the Report Type property. There are several "System Reports" that ship with every Grooper install. Many of these are designed to track batch processing automation rates or the productivity of data entry clerks. For example, the "Keystrokes" report tracks the number of keystrokes logged during Data Review for Batches using a specified Batch Process.
Generally, "repository" is a term meaning a location where files and/or data is stored and managed.
In Grooper, the term "repository" usually refers to one of two things:
In Grooper Design Studio, the root node is also called the Grooper Root and contains all Grooper node objects. It is the top level of the node tree hierarchy. It also stores several settings that apply to the connected Grooper Repository, including the Active File Store location and License Server.
Row Match is one of three Table Extraction methods available to Data Table Data Elements to extract information from tables on a document set. It uses regular expression pattern matching to determine a tables structure based on the pattern of each row and extract cell data from each column.
The Rules-Based Classification Method is one of three methods of classifying documents available to Grooper. This approach uses Data Extractors to find key words, phrases, or other text-based information in order to identify and classify a document (assigning a Document Type to the Document Folder).
The Rules-Based method classifies documents according to "rules" using the Positive Extractor and Negative Extractor properties of Document Type objects in a Content Model.
If an extractor set as the Positive Extractor returns a result on a document, the document would be classified as that Document Type. The Negative Extractor works the opposite way. If this extractor finds a result on a document, it would be prevented from being classified as that Document Type. This type of classification can be useful if a document's structure is always predictable or has a fixed title heading or form number and OCR errors are not an issue.
Classification is then performed using the Classify activity, using the extraction rules established by the Positive and Negative Extractor properties of Document Types in a Content Model.
A Scanner Profile defines connections to a document capture device (scanners etc) and settings which control how the device operates.
ISIS and TWAIN drivers are supported. The scanner's driver must be installed before Grooper can connect to it through a Scanner Profile. Optionally, you may attach an IP Profile to a Scanner Profile for custom image cleanup of images. If the scanner has its own image processing software, you should disable it to fully take advantage of Grooper's IP Profiles.
You may also create, edit and delete Scanner Profiles from Scan activity step in a Batch Process if the Enable Profile Editing and Enable Profile Management properties are enabled..
Activities can run at different levels in a Batch.
They can run on the following:
Separation, in Grooper, is the process of turning loose pages into documents, by determining points in a Batch at which Batch Folders are created and subsequent Batch Pages are placed inside.
Pages are organized into document folders during the Separate activity. There are a variety of methods to separate pages into documents during this activity, including (but not limited to) the use of printed control sheets, defined page lengths, and extractible text content. The specific separation method is determined by the Separation Provider and its configuration used during the Separate activity. You may also save and re-use a Separation Provider's configuration settings by creating a Separation Profile.
There are five different Separation Events:
When the event is triggered, one of two things can happen:
- A new folder is created and pages are included until a new event is triggered which will result in a new document
- The page triggering the event may be deleted
These settings determine how loose pages are separated and grouped into documents, placing them in Batch Folders within the Batch. For each profile, a Separation Provider is specified, defining the method by which pages are separated. The Separation Profile nodes also contain a "Testing" tab to verify results of the profile.
Separation Providers are the available methods Grooper has to separate pages into document folders.
Each provider has its own configurable properties. Changing these properties will change the criteria to separate pages into documents.
- See #Import Mode
- See #OCR Synthesis
TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic intended to reflect how important a word is to a document within a collection (or document set or “corpus”). It is how Grooper uses machine learning for training-based document classification (via the Lexical method) and data extraction (via the Field Class extractor).
TF-IDF is designed to reflect how important a word is to a document or a set of documents. The importance value (or weighting) of a word increases proportionally to the number of times it appears in the document (Term Frequency). The weighting is offset by the number of documents in the set containing the word (Inverse Document Frequency). Some words appear more generally across documents and hence are less unique identifiers. So, their weighting is lessened.
Tab Marking allows you to insert tab characters into a document's text data.
The Tab Marking property enables tab characters for regular expression pattern matching. These characters are inserted into a document's text data wherever there is a large gap of space between characters on a line. Space characters are converted to tab characters based on the width of the gap between text characters.
Tables are one of the most common ways data is organized on documents. Human beings have been writing information into tables before they started writing literature, even before paper was invented. There are examples of tables carved onto the walls of Egyptian temples! They are excellent structures for representing a lot of information with various characteristics in common in a relatively small space (or an Egyptian temple sized space). However, targeting the data inside them presents its own set of challenges. A table’s structure can range from simple and straightforward to more complex (even confounding). Different organizations may organize the same data differently, creating different tables for what, essentially, is the same data.
Test Batches are only visible to Grooper Design Studio users and are used to test anything configured in Grooper, such as:
- Document Classification
- OCR Profile configuration
- IP Profile configuration
- Data Extraction
- Whole or portions of Batch Processes ( or Activity configurations more generally)
They are not meant for full fledged, "real world" document processing. These Batches are not visible to "Production" level users, such as Grooper Dashboard workstations.
In computer science, a thread is the smallest unit of processing that can be performed within an operating system.
One thread can perform one "task" in Grooper. This allows for concurrent processing of multiple tasks. For example, if your system has 8 threads available, and the Recognize activity is set to the Page level Scope, the activity will run on 8 pages at a time. If your system only has 2 threads available, only 2 pages will be processed. If you have 64 threads, 64 pages will be processed at a time.
You can control how many threads are used to process an Activity using the Tread Pool object and installing the Activity Processing service in Grooper Config.
Thread Pools allow Grooper to allocate system resources to different Unattended Activities.
New Grooper Repositories initialize with a single Thread Pool named "Default". In many cases, this is sufficient to run all activities. However, one can alter how many threads are allocated to an activity in a Batch Process. This is done by creating a new Thread Pool, assigning it to an Activity Processing Grooper Service from Grooper Config, specifying how many threads that Thread Pool should use, and referencing the Thread Pool on that activity's Batch Process Step properties.
A training based approach to document classification classifies documents according to the similarity of unclassified documents to trained examples of that kind of document (or Document Type from Grooper's perspective).
There are two training based approaches in Grooper.
- This classification method trains text features (words and phrases) of examples documents. Document samples are trained as examples of a Document Type of a Content Model. Training occurs via user supervised machine learning using the TF-IDF algorithm. A Data Extractor set on the Text Feature Extractor property returns words, phrases or other results to provide possible identifiers used to classify a document. These identifiers (the words, phrases or other results from the Data Extractor used) are called "features." Document training uses TF-IDF to assign weightings to those features. During classification, Grooper looks at the weightings list for the various trained Document Types and compares them to the text features on the current document to be classified. The document is then assigned a percentage similarity score to each possible Document Type match. Whichever Document Type has the highest percentage similarity is used to classify the document.
- Note: This is the most common method. It is so common "training based approach" and "Lexical classification" are often used interchangibly.
- The Visual classification method uses image data instead of text data to determine the Document Type. Instead of using text extractors, an IP Profile will be set with an Extract Features command to get data pertaining to a document's image. Document samples are trained as examples of a Document Type.
- Note: While this is a much less commonly used method, it is still technically a training based approach to classification.
UNC (Universal Naming Convention) is a standard used in Windows for accessing shared network folders.
These names consist of a host machine name, a share name, and an optional file path. Drive letters (C:, D:, etc.) are not used in UNC path names. UNC paths must be fully qualified, meaning the entire path location must be typed out, starting with the top most directory in the hierarchy (such as the server's name). This disambiguates file and folder locations on one networked machine from another. The UNC convention for Windows paths is as follows:
It is always best practice to use UNC paths. Mapped and local drive references may not be accessible to other users or machines. </onlyinclude>
The Visual classification method uses image data instead of text data to determine the Document Type. Instead of using text extractors, an IP Profile is used with an Extract Features command to obtain data pertaining to a document's image. Document samples are trained as examples of a Document Type.
For example, a common feature used is "intensity". The document is divided into cells and the percentage of black to white pixels is measured. During classification, Grooper looks at the values obtained by the IP Profile and compares them to those on the document to be classified. The document is then given a percentage similarity score to each Document Type. Whichever Document Type has the highest percentage similarity is assigned to the document. In the case of the "intensity" example, each cell's intensity is compared with the training example to determine similarity via the black to white pixels ratio.
Think of a structured form, where the lines and text change very little. Therefore, if the document is divided into cells, the percentage of black pixels in that cell will be very similar from document to document.
Visual classification is unique in that it does not require OCR. It can be performed real time during scanning.