OCR (Concept)

OCR Synthesis

Synthesis is Grooper's unique approach to getting better results from an OCR Engine. Using Synthesis, portions of the document can be OCR'd independently from the full text OCR, portions of the image dropped out from the first OCR pass can be re-run, and certain results can be reprocessed. The results from the Synthesis operation then get combined with the full text OCR results from the OCR Engine into a single text flow.

Synthesis is a collection of five separate OCR processing operations:

Bound Region Processing
Iterative Processing
Cell Validation
Segment Reprocessing
Font Pitch Detection

As separate operations, the user can choose to enable all four operations, choose to use only one, or any combination. Synthesis is enabled on OCR Profiles, using the "Synthesis" property. This property is enabled by default on OCR Profiles (and can be disabled if you so choose). However, each Synthesis operation needs to be configured independently in order to function.

The general idea behind each of these operations is to increase the accuracy of OCR results by narrowing the OCR Engine's "field of vision". In general, the less the OCR Engine has to look at, the better the results will be. Rather than expecting the OCR Engine to get highly specific character accuracy by looking at the whole image, each operation breaks up the image up in some way, allowing the OCR Engine to only focus on a portion of it. The accuracy for that portion is then increased and the results are "synthesized" into a final, more accurate, result.

Bound Region Processing

Bound Region Processing performs OCR independently on text inside regions fully enclosed within lines. In other words, it processes text inside a box separately from the full page OCR. This vastly improves the OCR results for text inside tables or a complex line structure. By limiting OCR to just what is inside the box, the rest of the content on the page is not competing for the OCR Engine's attention, ultimately improving the result.

It does change how OCR runs quite a bit. Bound Region Processing actually runs before full page OCR. The order of operations is as follows:

1) Bound Region Detection - First, boxes are identified on the page.

Box size can be configured using Bound Region Processing's properties. There are also options to merge boxes of the same height and to ignore boxes that span accross the entire width of the page. Since each box is OCR'd independently, this can reduce the number of total OCR operations, which will reduce the time it takes for Bound Region Processing to run.
Bound Region Detection works from the original image, not an IP image (if created using the OCR Profile's "IP Profile" property). So, it will ignore any Line Removal command applied during the temporary image pre-processing.

2) Bound Region OCR - After bound regions are identified, text within each bound region is OCR'd.

Each region is OCR'd independently. If there are ten boxes, there will be ten OCR operations, one for each box.

3) Bound Region Dropout - Since the contents of these regions have been OCR'd, these pixels are removed from the image used for full page OCR. Grooper already has text

Bound Region Processing is a one-two punch of OCR accuracy. Not only does it improve the accuracy of text inside bound regions, it can also increase the accuracy of text outside bound regions. Just like the rest of the image can interfere with the accuracy of OCR'd text inside the boxes, the boxes and text inside can interfere with the OCR'ing the other text on the page. Dropping the bound region can give a bonus accuracy boost to the rest of the document.

4) Full Page OCR - The OCR Engine then runs on the resulting image, grabbing the rest of the text from the image.

5) Synthesis - Finally, the two results (the results inside bound regions + the results outside the bound regions) are merged together into a single text flow.

Configuring Bound Region Processing

Enable Bound Region ProcessingVerify It's WorkingConfigure Properties (If Necessary)

With Synthesis enabled, Bound Region Processing is a configurable property on OCR Profiles.

Selecting an OCR Profile, navigate to the "Bound Region Processing" property and change it from "Disabled" to "Enabled"

For the majority of cases, Bound Region Processing will successfully detect bound regions using the default properties. You can verify this using the "OCR Testing Tab".

Select a page from the Test Batch and press the "OCR Page" button. This will perform OCR on only the selected page to test out the OCR profile.

After you press the "OCR Page" page button, a new tab will appear underneath the image, the "Diagnostics" tab. This tab has several images related to the OCR operation. If Bound Region Processing was successful, you will see a "Bounded Regions" image. Select that image. All bounded regions will be highlighted in green and outlined in blue.

Also, FYI, the "Main OCR Input" image shows the original image with the text OCR'd from bound region processing dropped out. This is what will be handed to the OCR Engine for full page OCR. (Furthermore, if we ran a temporary IP Profile on this OCR Profile, we could easily get rid of those table lines as well, further increasing the efficacy of the OCR operation.)

Bound Region Processing has several properties you can configure if necessary. You can reveal its properties by double clicking "Bound Region Processing" in the OCR Profile, or pressing the carat button to the left of the property.

Properties affecting box size and detection

Property	Default Value	Information
Minimum Size	6pt	This setting controls the minimum width or height of a box. So, the default box size will detect a minimum of a box 6 pt wide by 1 pt high or 1 pt wide by 6 pt high. That is a fairly small box. If Bound Region Processing is detecting bound regions that aren't boxes, you may find it useful to increase the size of this property.
Minimum Area	12pt	This setting controls the minimum area of a box. This works in combination with the "Minimum Size" property to control which boxes are detected. So, even though the Minimum Size default is 6 pt, Bound Region Processing won't actually detect a 6 pt wide by 1 pt high box, because its area is only 6 pt (6 x 1 = 6 and 6 < 12). Similarly this property is helpful to narrow down which bound regions should be included in the Bound Region Processing operation.
Maximum Width Ratio	75%	This property controls the maximum width of a single box based on it's size corresponding to the whole page. At 75%, a single box will not be detected if it is larger than three quarters of the width of the whole page. If you want to detect boxes of any width, even if they span the full width of the page, you will set this property to 100%.
Maximum Height	1in	Here, you can limit the maximum height of a box. If you wish to detect boxes of any height, change this property to "0". This property also interacts with the "Always Allow Landscape" property. See below for important information on how they interact.
Always Allow Landscape	True	By default, boxes that are longer than they are high (having a "landscape" instead of "portrait" orientation) are exempted from exclusion if they are higher than the the "Maximum Height" value. Only boxes that are narrower than they are high ("portrait" orriented boxes) will be excluded from Bound Region Processing. If you are attempting to remove boxes that are longer than they are high from processing using the "Maximum Height" value, set this property to "False".
Maximum Count	0	With this property set to "0" there will be no limit to the number of boxes detected. If you do enter a maximum count value, bound region detection will stop once it finds one less than the maximum value (i.e. If you enter a Maximum Count of "10" and there are 11 boxes on the page, only 9 bound regions will be detected.)

The Merge Regions property

The Merge Regions property does not have to do with how regions are detected, but instead how those regions are processed. When enabled, it will merge adjacent boxes next to each on a horizontal line as long as they are the same height. Furthermore, they must themselves meet a height requirement in order to be merged, set by the "Maximum Merge Height" property.

This can speed up the time it takes Bound Region Processing to run by lowering the number of total OCR operations. However, this does have the potential of negatively impacting the accuracy of the results in each cell. Whether or not you choose to use this property will mostly depend on if you need to value the speed of the OCR operation over its accuracy. This property is enabled by default. You will need to disable it in order to see if it impacts the accuracy of Bound Region Processing.

Property	Default Value	Information
Merge Regions	True	This setting controls whether or not adjacent boxes of the same size are merged together.
Maximum Merge Height	14pt	Adjacent boxes of the same height smaller than this value will be merge together. If you wish to ignore a maximum merge height, merging all boxes of the same height on the same line regardless of size, enter "0" here.

Iterative Processing

Iterative Processing improves the OCR operation by performing a second pass at OCR. After the OCR Engine performs full page OCR, characters recognized from the first pass are digitally dropped out. Then, a second OCR pass is run on the resulting image. This way characters that were ignored from the first pass can be isolated and recognized separately. And the results are merged with the OCR results from the first pass.

First, OCR runs on the full page.	Recognized characters are digitally removed. A second OCR pass runs on the remaining portion of the image.

Configuring Iterative Processing

Enable Iterative ProcessingVerify It's Working

With Synthesis enabled, "OCR Iterations" is a configurable property on OCR Profiles.

Selecting an OCR Profile, navigate to the "OCR Iterations" property and change it from "1" to "2". There can be a maximum of two OCR iterations. Changing this property to "2" enables the second pass if the first pass skips over any characters.

Also note "OCR Iterations" does not have any configurable properties of its own. It is either enabled or disabled with no further configuration necessary.

You can verify if a second pass was run using the "OCR Testing Tab".

Select a page from the Test Batch and press the "OCR Page" button. This will perform OCR on only the selected page to test out the OCR profile.

After you press the "OCR Page" page button, a new tab will appear underneath the image, the "Diagnostics" tab. This tab has several images related to the OCR operation. If Iterative Processing was successful, you will see an "IP Image", which is the first full text OCR iteration, and a "Second Iteration" which is the image used for the second pass, with all the previously recognized characters digitally dropped out.

!

A second OCR pass is only done if portions of the image are not assigned a text character. If all characters are recognized in the first pass, the second pass will just be given a blank image, and the second pass won't run. In these cases, even with "OCR Iterations" enabled, you will not see a "Second Iteration" diagnostic image.

OCR Synthesis

Bound Region Processing

Configuring Bound Region Processing

Properties affecting box size and detection

The Merge Regions property

Iterative Processing

Configuring Iterative Processing

Cell Validation