2.80:Layered OCR (OCR Engine)

From Grooper Wiki

Layered OCR is an OCR engine that enables you to run multiple OCR profiles in a "layered" manner.

About

This engine is designed to read documents with specialized print types (such as MICR, handwriting, etc.) and to ensure as close to 100% accuracy as possible for background elements on forms.

How it works

  1. The "Main OCR Profile" establishes the base OCR output.
  2. Each subsequent layer is configured to run an additional OCR Profile and an "Extractor".  This is key, as these are run independent of the "Main OCR Profile", so specialized OCR Profiles can be created and utilized to ensure the desired data is extracted properly.
  3. The extraction results of each subsequent layer are then merged into the final OCR output.

Use Cases

Mixed Print Types

Layered OCR's usefulness shines when needing to extract text from documents with multiple print types.

Label Repair

You can also use Layered OCR to "repair" lines of text, eliminating inaccuracies in field labels and greatly simplifying data extraction.

Properties

General
Main OCR Profile The OCR Profile to be used to establish the base OCR output.
Layers Defines one or more OCR Layers which supplement or repair output from the Main OCR Profile.

Layers

Each layer is configured independently and has available the following configurable properties.

General
OCR Profile The OCR Profile to execute.

If different from the Main OCR Profile, a new OCR operation will be performed using this profile. The Extractor will be executed against the results of this OCR operation, and any matches will be merged into the main OCR output.

If the OCR Profile specified is the Main OCR Profile, then no new OCR is performed. The Extractor will be executed against the main OCR output, and any matches produced will be merged into the main OCR output. For example, in this mode one could fuzzy match full lines of field labels on a structured form, and repair the OCR results with the corrected output from the fuzzy match operation.

Extractor The extractor used here MUST meet special requirements, or it will not produce output instances which can be used for OCR repair. Check the "Log" tab when testing OCR to see error messages for invalid matches.
  • Each match produced by the extractor must reference a contiguous sequence of characters occurring within a single line of text.
  • FuzzyRegEx and FuzzyList are the only supported match modes.
  • To prevent unintentional matches across lines, ensure that the Fuzzy Match Weightings lists line break characters as immutable. (i.e. "Immutable=\r\n").
  • Do not use fuzzy lookups, output formats, or lexicon translation to modify output values.
  • Do not use any Collation Provider which produces non-continuous output values when configuring a Data Type.