Azure OCR (OCR Engine): Difference between revisions

From Grooper Wiki
// via Wikitext Extension for VSCode
// via Wikitext Extension for VSCode
Line 16: Line 16:
'''''Azure OCR''''' is different from traditional '''''[[OCR Engine (Property)|OCR Engines]]'''''. It is a CNN (Convolutional Neural Network) based '''''OCR Engine''''' meaning that it is AI based. Due to the way this neural network has been trained, Azure OCR is less dependent on '''''[[Image Processing]]'''''.  
'''''Azure OCR''''' is different from traditional '''''[[OCR Engine (Property)|OCR Engines]]'''''. It is a CNN (Convolutional Neural Network) based '''''OCR Engine''''' meaning that it is AI based. Due to the way this neural network has been trained, Azure OCR is less dependent on '''''[[Image Processing]]'''''.  


Unlike traditional OCR, '''''Azure OCR''''' has a far higher accuracy when recognizing handwritten text on documents. However, '''''Azure OCR''''' alone does not give 100% accurate position data of characters, it only gives us an approximation. This can cause problems for extractors that are reliant on character/text positions such as '''''Labeled Value''''', '''''Labeled OMR''''', or '''''Tabular Layout'''''. '''''Azure OCR''''' also does not always capture smaller numeric values such as 1s and 0s. This can make collecting some data problematic.  
Unlike traditional [[OCR (Concept)|OCR]], '''''Azure OCR''''' has a far higher accuracy when recognizing handwritten text on documents. However, '''''Azure OCR''''' alone does not give 100% accurate position data of characters, it only gives us an approximation. This can cause problems for extractors that are reliant on character/text positions such as '''''[[Labeled Value (Extractor Type)|Labeled Value]]''''', '''''[[Labeled OMR (Extractor Type)|Labeled OMR]]''''', or '''''[[Tabular Layout (Table Extract Method)|Tabular Layout]]'''''. '''''Azure OCR''''' also does not always capture smaller numeric values such as 1s and 0s. This can make collecting some data problematic.  


To compensate, a traditional '''OCR Engine''' (Transym) runs at the same time when using '''''Azure OCR''''' because traditional OCR is highly effective at obtaining position data and can capture smaller values. A traditional '''OCR Engine''' is more dependent on '''''Image Processing'''''. When choosing '''''Azure OCR''''', a default set of '''''Image Processing''''' steps are applied to the document to improve traditional OCR accuracy.  
To compensate, a traditional '''OCR Engine''' (Transym) runs at the same time when using '''''Azure OCR''''' because traditional OCR is highly effective at obtaining position data and can capture smaller values. A traditional '''OCR Engine''' is more dependent on '''''Image Processing'''''. When choosing '''''Azure OCR''''', a default set of '''''Image Processing''''' steps are applied to the document to improve traditional OCR accuracy.  

Revision as of 14:50, 4 October 2024

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025

Azure OCR is an OCR Engine option for OCR Profiles that utilizes Microsoft Azure's Read API. Azure's Read engine is an AI-based text recognition software that uses a convolutional neural network (CNN) to recognize text. Compared to traditional OCR engines, it yields superior results, especially for handwritten text and poor quality images. Furthermore, Grooper supplements Azure's results with those from a traditional OCR engine in areas where traditional OCR is better than the Read engine.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2024). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

About

Azure OCR is different from traditional OCR Engines. It is a CNN (Convolutional Neural Network) based OCR Engine meaning that it is AI based. Due to the way this neural network has been trained, Azure OCR is less dependent on Image Processing.

Unlike traditional OCR, Azure OCR has a far higher accuracy when recognizing handwritten text on documents. However, Azure OCR alone does not give 100% accurate position data of characters, it only gives us an approximation. This can cause problems for extractors that are reliant on character/text positions such as Labeled Value, Labeled OMR, or Tabular Layout. Azure OCR also does not always capture smaller numeric values such as 1s and 0s. This can make collecting some data problematic.

To compensate, a traditional OCR Engine (Transym) runs at the same time when using Azure OCR because traditional OCR is highly effective at obtaining position data and can capture smaller values. A traditional OCR Engine is more dependent on Image Processing. When choosing Azure OCR, a default set of Image Processing steps are applied to the document to improve traditional OCR accuracy.

Grooper attempts to return the most accurate results from both the Azure OCR and the traditional OCR Engine.

Traditional OCR vs. Azure OCR

In the screenshots below, we can see the difference between using traditional OCR and Azure OCR on a document that has small text and handwriting.

  1. In the first screenshot, we can see the result of using traditional OCR. The traditional OCR is not equipped to handle handwriting, and the small print with minimal spaces between the characters makes it very difficult for traditional OCR.


  1. In this second screenshot, we have used Azure OCR on the same document. Azure OCR relies on the CNN AI training rather than individual character analysis. Azure OCR does a much better job at returning accurate data for this particular document, even the handwritten sections.


Azure OCR drawbacks

Azure OCR by itself is not perfect. There are things that traditional OCR is better at capturing than Azure OCR, so we don't solely rely on Azure OCR for recognizing text. Instead, when selecting Azure OCR as your OCR Engine both Azure OCR and a default traditional OCR Engine will both run and Grooper will combine the results.

A couple of things that Azure OCR is not as adept at returning are small numbers such as single 0s and 1s and accurate character/text segment positions. As said before, this can be especially problematic when using extractors that heavily rely on positioning data.

  1. In the screenshot below, we are looking at the Diagnostics page after running Recognize configured with Azure OCR on a document.
  2. In the Diagnostics, the "Azure Words.tif" will show what Azure OCR by itself returned.
  3. In this case, there are two 0s that are not being captured at all by Azure OCR. They are small numbers that have been skipped.
  4. We also see that Azure OCR found all numeric values in the PAID AMT column in the table, but the positioning data is not accurate.


  1. If we select the "Alignment.tif" in the Diagnostics tree on the left, we can see the combined result of Azure OCR and the traditional OCR Engine.
  2. The characters and text segments on the document highlighted in orange are corrections made from the results of traditional OCR. The traditional OCR Engine detected the 0s that Azure OCR missed.
  3. Grooper also determined that the traditional OCR Engine did a better job at recognizing one of the numeric values whose position data was not accurately detected by Azure OCR.


The result of combining both OCR Engines is what Grooper will actually recognize from the document.

How to

To use Azure OCR you will need to add and configure an OCR Profile. Then you will need to add that OCR Profile to your Recognize Batch Process Step. Then you can test your Step or run the Batch Process when complete.

Setting up the OCR Profile

Adding an OCR Profile

  1. Right-click on the Project or folder inside of your Project in your Node Tree where you want to add your OCR Profile.
  2. Hover over "Add".
  3. Click on "OCR Profile..."


  1. Enter in your desired name for your OCR Profile in the Name property field.
  2. Click "EXECUTE" in the top right-hand corner of the pop-up window to create your OCR Profile.


  1. Now you should have a new OCR Profile in your Node Tree.


Configuring the OCR Profile

  1. Click the hamburger icon to the right of the OCR Engine property to access the drop down menu.
  2. Select Azure OCR from the drop down menu.


  1. Copy and paste your unique API Key into the API Key property and select your API Region from the drop down menu accessed by clicking on the hamburger icon next to the property.
  2. Optionally, you can add a Traditional Ocr Profile. If this property is left blank, Grooper will run a basic Traditional OCR Engine (Transym) in addition to Azure OCR. If you would like to override the default, you can select a different OCR Profile here.
  3. Click the save icon in the top right of the property grid to save your changes.

Adding the OCR Profile to the Recognize Step

You will need to add a Batch Process Step configured with the Recognize Activity to your Batch Process. You will also need to configure the Step Properties such as the Activity and Scope. For help with setting up your Batch Process, take a look at our Batch Process article.

  1. Add and select the Recognize Step in your Batch Process in the Node Tree.
  2. Click on the hamburger icon to the right of the OCR Profile property to access the navigation drop down.
  3. Navigate to and select the OCR Profile that has been configured with the Azure OCR OCR Engine.


  1. Finish configuring your Batch Process Step and then click the save icon located in the top right of the Step Properties property grid to save your changes.

Glossary

Activity: Grooper Activities define specific document processing operations done to a inventory_2 Batch, folder Batch Folder, or contract Batch Page. In a settings Batch Process, each edit_document Batch Process Step executes a single Activity (determined by the step's "Activity" property).

  • Batch Process Steps are frequently referred by the name of their configured Activity followed by the word "step". For example: "Classify step".

Alignment: "Alignment" refers to how Grooper highlights text from an AI response on a document in a Document Viewer. Alignment properties can be configured to alter how Grooper highlights results when using LLM-based extraction methods, such as AI Extract.

Batch: inventory_2 Batch nodes are fundamental in Grooper's architecture. They are containers of documents that are moved through workflow mechanisms called settings Batch Processes. Documents and their pages are represented in Batches by a hierarchy of folder Batch Folders and contract Batch Pages.

Batch Process Step: edit_document Batch Process Steps are specific actions within a settings Batch Process sequence. Each Batch Process Step performs an "Activity" specific to some document processing task. These Activities will either be a "Code Activity" or "Review" activities. Code Activities are automated by Activity Processing services. Review activities are executed by human operators in the Grooper user interface.

  • Batch Process Steps are frequently referred to as simply "steps".
  • Because a single Batch Process Step executes a single Activity configuration, they are often referred to by their referenced Activity as well. For example, a "Recognize step".

Batch Process: settings Batch Process nodes are crucial components in Grooper's architecture. A Batch Process is the step-by-step processing instructions given to a inventory_2 Batch. Each step is comprised of a "Code Activity" or a Review activity. Code Activities are automated by Activity Processing services. Review activities are executed by human operators in the Grooper user interface.

  • Batch Processes by themselves do nothing. Instead, they execute edit_document Batch Process Steps which are added as children nodes.
  • A Batch Process is often referred to as simply a "process".

Image Processing: "Image processing", as a general term, refers to software techniques that manipulate and enhance images. Image processing removes imperfections and adjusts images to improve OCR accuracy. In Grooper, images are processed primarily by two Activities:

  • Image Processing - This Activity permanently adjusts the image using. It is primarily used to compensate for defects produced by a document scanner (like border artifacts and skewed images). It does so by applying IP Commands in an perm_media IP Profile.
  • Recognize - This Activity performs OCR. When an library_books OCR Profile references an perm_media IP Profile, the image will be processed temporarily. A temporary image is handed to the OCR engine and discarded once characters are recognized.
  • Grooper also has "computer vision" capabilities that analyze and interpret images. These capabilities are also executed during Grooper's image processing. For example, Grooper's "Line Removal" command will locate lines on an image (computer vision), remove those artifacts to improve OCR results during Recognize (image processing) and store that data for later use in Grooper (computer vision).

Image Processing: wallpaper Image Processing is an Activity that enhances contract Batch Page images and optimizes them for better OCR text recognition and data extraction results.

IP Profile: perm_media IP Profiles are a step-by-step list of image processing operations (IP Commands). They are used for several image processing related operations, but primarily for:

  1. Permanently enhancing an image during the Image Processing activity (usually to get rid of defects in a scanned image, such as skewing or borders).
  2. Cleaning up an image in-memory during the Recognize activity without altering the image to improve OCR accuracy.
  3. Computer vision operations that collect layout data (table line locations, OMR checkboxes, barcode value and more) utilized in data extraction.

Labeled OMR: Labeled OMR is a Value Extractor used to output OMR checkbox labels. It determines whether labeled checkboxes are checked or not. If checked, it outputs the label(s) or a Boolean true/false value as the result.

Labeled Value: Labeled Value is a Value Extractor that identifies and extracts a value next to a label. This is one of the most commonly used extractors to extract data from structured documents (such as a standardized form) and static values on semi-structured documents (such as the header details on an invoice).

Node Tree: The Node Tree is the hierarchical list of Grooper node objects found in the left panel in the Design Page. It is the basis for navigation and creation in the Design Page.

OCR: OCR is stands for Optical Character Recognition. It allows text on paper documents to be digitized, in order to be searched or edited by other software applications. OCR converts typed or printed text from digital images of physical documents into machine readable, encoded text.

OCR Engine: An "OCR engine" is the part of OCR software that recognizes text from images. OCR engines analyze the image's pixels to determine where text is on the page and what each character is. In Grooper, OCR engines are selected when configuring an OCR Profile's OCR Engine property.

OCR Profile: library_books OCR Profiles store configuration settings for optical character recognition (OCR). They are used by the Recognize activity to convert images of text on contract Batch Pages into machine-encoded text. OCR Profiles are highly configurable, allowing fine-grained control over how OCR occurs, how pre-OCR image cleanup occurs, and how Grooper's OCR Synthesis occurs. All this works to the end goal of highly accurate OCR text data, which is used to classify documents, extract data and more.

Project: package_2 Projects are the primary containers for configuration nodes within Grooper. The Project is where various processing objects such as stacks Content Models, settings Batch Processes, profile objects are stored. This makes resources easier to manage, easier to save, and simplifies how node references are made in a Grooper Repository.

Recognize: format_letter_spacing_wide Recognize is an Activity that obtains machine-readable text from contract Batch Pages and folder Batch Folders. When properly configured with an library_booksOCR Profile, Recognize will selectively perform OCR for images and native-text extraction for digital text in PDFs. Recognize can also reference an perm_mediaIP Profile to collect "layout data" like lines, checkboxes, and barcodes. Other Activities then use this machine-readable text and layout data for document analysis and data extraction.

Scope: The Scope property of a edit_document Batch Process Step, as it relates to an Activity, determines at which level in a inventory_2 Batch hierarchy the Activity runs.

Tabular Layout: The Tabular Layout Table Extract Method uses column header values determined by the view_column Data Columns Header Extractor results (or labels collected for the Data Columns when a Labeling Behavior is enabled) as well as Data Column Value Extractor results to model a table's structure and return its values.