OCR Synthesis

From Grooper Wiki
Jump to navigation Jump to search

The Synthesis functionality is Grooper's unique method of pre-processing and re-processing raw results from the OCR engine to get better results out of it.

Using Synthesis, portions of the document can be OCR'd independently from the full text OCR. Portions of the image dropped out from the first OCR pass can be re-run through the OCR engine. And, certain results can be reprocessed. The results from the Synthesis operation are combined with (or in some cases replace) the full text OCR results from the OCR Engine into a single text flow.


About

Synthesis is a collection of five separate OCR processing operations:

As separate operations, the user can choose to enable all five operations, choose to use only one, or a combination. Synthesis is enabled on OCR Profiles, using the Synthesis property. This property is enabled by default on OCR Profiles (and can be disabled if you so choose). However, each Synthesis operation needs to be configured independently in order to function.

Ocr synthesis 1.png

Ocr synthesis 2.png

The general idea behind each of these operations is to increase the accuracy of OCR results by narrowing the OCR Engine's "field of vision". In general, the less the OCR Engine has to look at, the better the results will be. Rather than expecting the OCR Engine to get highly specific character accuracy by looking at the whole image, each operation breaks up the image up in some way, allowing the OCR Engine to only focus on a portion of it. The accuracy for that portion is then increased and the results are "synthesized" into a final, more accurate, result.

Font Pitch Detection

Font Pitch Detection is the most fundamental of the Synthesis options. It exists because of a common OCR problem involving fonts. On the one hand, fonts are great. They're visually appealing. There are innumerable fonts available for web and print purposes. Businesses use them to brand themselves, imprinting their identity on their printed documents. Besides "looking good", documents use different fonts for a variety of reasons. Different fonts can denote different sections of a document. Often, different fonts are used for a field labels and field values. They are a visual cue that one piece of text is distinct from another.

The downside is this variability in font usage from business to business, document to document, and line to line within a document, simply put, makes OCR more difficult. Even if the OCR engine correctly recognizes individual characters, for certain types of fonts, it can fail to recognize the spaces between those characters. Spaces exist between words for a reason! Without them, reading a simple sentence would be exceptionally difficult. Where does one word begin and another end?

Fixed Width Detection 01.png

This can present problems for regular expression pattern matching when you expect to see spaces between words or other character segments, ultimately impacting separation, classification or data extraction using Data Type extractors.

When this happens, it is typically because the font is variable width or variable pitched. There are two types of fonts:

  • Fixed Pitch - Each character is the same horizontal width. As a character itself, spaces are also the same width as any character in the font's alphabet.
  • Variable Pitch - Characters can be different horizontal widths. For example, "i"s and "l"s tend to be very skinny, while "M"s and "W"s tend to be wider. The spaces in these fonts are an average width of the characters in the set (or simply an assigned width).

Font Pitch Detection attempts to determine if a font is a fixed or variable pitch font based on the width of a sequence of characters on a line. If you have three or more characters in a row and they're all the same width, Grooper will assume it's a fixed pitch font. If you have three or more characters in a row and they're different font widths, Grooper will assume it's a variable pitch font.

If Font Pitch Detection does detect a text segment is a variable pitch font, it will re-analyze the spaces between words and insert spaces if it sees them.

Fixed Width Detection 02.png

Fixed Width Detection 03.png

Configuring Font Pitch Detection

With Synthesis enabled, Font Pitch Detection is automatically enabled.

The Font Pitch property controls Font Pitch Detection. By default, it is set to Auto, which will analyze the character widths inside each text segment to determine if it is a fixed or variable pitched font.

This property can also be set to Variable to treat all text as variable pitched fonts and Fixed to treat all text as fixed pitched fonts.

Fixed Width Detection 04.png

With Font Pitch set to Auto, the Maximum Variance property controls the variance in character widths allowed for fixed pitched fonts.

By default, this property is 12%, meaning a character can be 12% wider or narrower than the average character width in the text segment and it still count as a fixed pitch font. This accounts for uncontrollable variation of the pixel widths of fixed pitch characters resulting from scanning.

Depending on the image quality and fonts used, adjusting this property can help property insert spaces unrecognized by the OCR engine.

Fixed Width Detection 05.png

Using the default of 12%, the top two lines were detected as variable pitch.

  • Variable pitched fonts are highlighted in green.
  • Fixed pitched fonts are highlighted in blue.

Fixed Width Detection 06.png

Bumping up the Maximum Variance to 20%, all lines are detected as fixed pitch. A higher variance in the widths of these characters allowed for all segments to read as fixed pitched fonts.

Fixed Width Detection 07.png

With all these Synthesis settings, the "OCR Testing" tab of an OCR Profile can be very helpful.

  1. Navigate to the "OCR Testing" tab.
  2. Press the "OCR Page" button.
  3. Navigate to the "Diagnostics" tab of the Results Viwer.
  4. Select the "Execution Log" in the Diagnostics Panel. This will give you a text readout of the OCR operation.
    • Here, we can see Font Pitch Detection added spaces from the line "Synthesis added 5 control characters and spaces."

Fixed Width Detection 08.png

Bound Region Processing

Bound Region Processing performs OCR on bound regions independently, such as text inside cells in a table

Bound Region Processing performs OCR independently on text inside regions fully enclosed within lines. In other words, it processes text inside a box separately from the full page OCR. This vastly improves the OCR results for text inside tables or a complex line structure. By limiting OCR to just what is inside the box, the rest of the content on the page is not competing for the OCR Engine's attention, ultimately improving the result.

It does change how OCR runs quite a bit. Bound Region Processing actually runs before full page OCR. The order of operations is as follows:

1) Bound Region Detection - First, boxes are identified on the page.

  • Box size can be configured using Bound Region Processing's properties. There are also options to merge boxes of the same height and to ignore boxes that span accross the entire width of the page. Since each box is OCR'd independently, this can reduce the number of total OCR operations, which will reduce the time it takes for Bound Region Processing to run.
  • Bound Region Detection works from the original image, not an IP image (if created using the OCR Profile's "IP Profile" property). So, it will ignore any Line Removal command applied during the temporary image pre-processing.

2) Bound Region OCR - After bound regions are identified, text within each bound region is OCR'd.

  • Each region is OCR'd independently. If there are ten boxes, there will be ten OCR operations, one for each box.

3) Bound Region Dropout - Since the contents of these regions have been OCR'd, these pixels are removed from the image used for full page OCR. Grooper already has text

  • Bound Region Processing is a one-two punch of OCR accuracy. Not only does it improve the accuracy of text inside bound regions, it can also increase the accuracy of text outside bound regions. Just like the rest of the image can interfere with the accuracy of OCR'd text inside the boxes, the boxes and text inside can interfere with the OCR'ing the other text on the page. Dropping the bound region can give a bonus accuracy boost to the rest of the document.

4) Full Page OCR - The OCR Engine then runs on the resulting image, grabbing the rest of the text from the image.

5) Synthesis - Finally, the two results (the results inside bound regions + the results outside the bound regions) are merged together into a single text flow.

Configuring Bound Region Processing

With Synthesis enabled, Bound Region Processing is a configurable property on OCR Profiles.

Selecting an OCR Profile, navigate to the "Bound Region Processing" property and change it from "Disabled" to "Enabled"


Bound region 2.png

For the majority of cases, Bound Region Processing will successfully detect bound regions using the default properties. You can verify this using the "OCR Testing Tab".


Bound region 3.png


Select a page from the Test Batch and press the "OCR Page" button. This will perform OCR on only the selected page to test out the OCR profile.


Bound region 4.png


After you press the "OCR Page" page button, a new tab will appear underneath the image, the "Diagnostics" tab. This tab has several images related to the OCR operation. If Bound Region Processing was successful, you will see a "Bounded Regions" image. Select that image. All bounded regions will be highlighted in green and outlined in blue.


Bound region 5.png


Also, FYI, the "Main OCR Input" image shows the original image with the text OCR'd from bound region processing dropped out. This is what will be handed to the OCR Engine for full page OCR. (Furthermore, if we ran a temporary IP Profile on this OCR Profile, we could easily get rid of those table lines as well, further increasing the efficacy of the OCR operation.)


Bound region 6.png

Bound Region Processing has several properties you can configure if necessary. You can reveal its properties by double clicking "Bound Region Processing" in the OCR Profile, or pressing the carat button to the left of the property.


Bound region 7.png


Properties affecting box size and detection

Property Default Value Information
Minimum Size 6pt This setting controls the minimum width or height of a box. So, the default box size will detect a minimum of a box 6 pt wide by 1 pt high or 1 pt wide by 6 pt high. That is a fairly small box. If Bound Region Processing is detecting bound regions that aren't boxes, you may find it useful to increase the size of this property.
Minimum Area 12pt This setting controls the minimum area of a box. This works in combination with the "Minimum Size" property to control which boxes are detected. So, even though the Minimum Size default is 6 pt, Bound Region Processing won't actually detect a 6 pt wide by 1 pt high box, because its area is only 6 pt (6 x 1 = 6 and 6 < 12). Similarly this property is helpful to narrow down which bound regions should be included in the Bound Region Processing operation.
Maximum Width Ratio 75% This property controls the maximum width of a single box based on it's size corresponding to the whole page. At 75%, a single box will not be detected if it is larger than three quarters of the width of the whole page. If you want to detect boxes of any width, even if they span the full width of the page, you will set this property to 100%.
Maximum Height 1in Here, you can limit the maximum height of a box. If you wish to detect boxes of any height, change this property to "0". This property also interacts with the "Always Allow Landscape" property. See below for important information on how they interact.
Always Allow Landscape True By default, boxes that are longer than they are high (having a "landscape" instead of "portrait" orientation) are exempted from exclusion if they are higher than the the "Maximum Height" value. Only boxes that are narrower than they are high ("portrait" orriented boxes) will be excluded from Bound Region Processing. If you are attempting to remove boxes that are longer than they are high from processing using the "Maximum Height" value, set this property to "False".
Maximum Count 0 With this property set to "0" there will be no limit to the number of boxes detected. If you do enter a maximum count value, bound region detection will stop once it finds one less than the maximum value (i.e. If you enter a Maximum Count of "10" and there are 11 boxes on the page, only 9 bound regions will be detected.)

The Merge Regions property

The Merge Regions property does not have to do with how regions are detected, but instead how those regions are processed. When enabled, it will merge adjacent boxes next to each on a horizontal line as long as they are the same height. Furthermore, they must themselves meet a height requirement in order to be merged, set by the "Maximum Merge Height" property.

This can speed up the time it takes Bound Region Processing to run by lowering the number of total OCR operations. However, this does have the potential of negatively impacting the accuracy of the results in each cell. Whether or not you choose to use this property will mostly depend on if you need to value the speed of the OCR operation over its accuracy. This property is enabled by default. You will need to disable it in order to see if it impacts the accuracy of Bound Region Processing.

Bound region 8.png
Property Default Value Information
Merge Regions True This setting controls whether or not adjacent boxes of the same size are merged together.
Maximum Merge Height 14pt Adjacent boxes of the same height smaller than this value will be merge together. If you wish to ignore a maximum merge height, merging all boxes of the same height on the same line regardless of size, enter "0" here.

Iterative Processing

Iterative Processing improves the OCR operation by performing a second pass at OCR. After the OCR Engine performs full page OCR, characters recognized from the first pass are digitally dropped out. Then, a second OCR pass is run on the resulting image. This way characters that were ignored from the first pass can be isolated and recognized separately. And the results are merged with the OCR results from the first pass.

First, OCR runs on the full page. Recognized characters are digitally removed. A second OCR pass runs on the remaining portion of the image.
Iterative ocr 1.png Iterative ocr 2.png

Configuring Iterative Processing

With Synthesis enabled, "OCR Iterations" is a configurable property on OCR Profiles.

Selecting an OCR Profile, navigate to the "OCR Iterations" property and change it from "1" to "2". There can be a maximum of two OCR iterations. Changing this property to "2" enables the second pass if the first pass skips over any characters.

Also note "OCR Iterations" does not have any configurable properties of its own. It is either enabled or disabled with no further configuration necessary.


Iterative ocr 3.png

You can verify if a second pass was run using the "OCR Testing Tab".


Iterative ocr 4.png


Select a page from the Test Batch and press the "OCR Page" button. This will perform OCR on only the selected page to test out the OCR profile.


Iterative ocr 5.png


After you press the "OCR Page" page button, a new tab will appear underneath the image, the "Diagnostics" tab. This tab has several images related to the OCR operation. If Iterative Processing was successful, you will see an "IP Image", which is the first full text OCR iteration, and a "Second Iteration" which is the image used for the second pass, with all the previously recognized characters digitally dropped out.


Iterative ocr 6.png


Iterative ocr 7.png


! A second OCR pass is only done if portions of the image are not assigned a text character. If all characters are recognized in the first pass, the second pass will just be given a blank image, and the second pass won't run. In these cases, even with "OCR Iterations" enabled, you will not see a "Second Iteration" diagnostic image.

More information about the operation can also be viewed in the "Execution Log" such as the time the second iteration took to run and how many characters were merged with the full page OCR.


Iterative ocr 8.png

Cell Validation

Cell validation breaks up the page into a matrix of rows and columns and verifies the result of full page OCR by OCR'ing each cell independently and merging the results with the full page results. This can improve OCR results by segmenting out documents with structures atypical from normal left to right paragraph style text flows. For example, documents with columns of text greatly benefit from cell validation. It allows each column to be OCR'd independently, greatly improving the results.

For a document like this, the page can be divided into a matrix of 1 row by 3 columns.
Cell val 1.png
After full page OCR runs, OCR is performed on each cell.
Cell val 2.png
Cell val 3.png
Cell val 4.png
The results from each cell are merged with the full page results, forming a more complete OCR result.
Cel val 5.png

Configuring Cell Validation

With Synthesis enabled, Cell Validation is a configurable property on OCR Profiles.

Selecting an OCR Profile, navigate to the "Enable Cell Validation" property and change it from "False" to "True". You will see the Cell Validation properties appear automatically when you do.


Cell val 6.png

Define Rows and Columns

The main thing you need to do when enabling Cell Validation is to define how many rows and columns you want to divide the page into. The default is 2 rows and 2 columns, but this will likely change depending on how your document is structured. The example above was a document with three columns of text. So the "Rows" property was set to "1" and "Columns" property was set to "3"

Property Default Value Information
Rows 2 Defines the number of rows to break up the image into. This will be the number of cells in each column.
Columns 2 Defines the number of columns to break up the image into. This will be the number of cells in each row.

Define the Overlap and Buffer Zone

By default there is a slight amount of overlap for each cell. The reason for this is twofold. First, even if the document structure doesn't change from document to document, the cell's position may be slightly off from one to another. This overlap will still allow the cell to be captured if it isn't precisely in the same spot every time. It gives the cells a little "wiggle room" to account for inconsistencies in scanning, printing or formatting.

Second, if a character falls on the edge of a cells boundary, only a portion of that character may fall in the cell. We would not want to OCR these portions of the character because they would give us inaccurate results. Instead Cell Validation sets a "Buffer Zone" around the edge of the cell and will remove characters from the OCR operation if they fall in this zone. Overlapping the cells slightly will allow these characters dropped out from the Buffer Zone to be captured by another cell.

Property Default Value Information
Cell Edge Buffer 0.1 The size, in inches, of the buffer zone around the cells border. Any character falling in this zone will not be OCR'd, avoiding partial characters from being mis-recognized.
Cell Overlap 0.25 The size, in inches, each cell overlaps the other. Note, this can be set to "0" to disallow cell overlap. However, this may result in characters being dropped out from the Buffer Zone being excluded from the Cell Validation operation.

Optionally Skip the First Column

The "Skip First Column" property allows you to ignore cell validation for the first column and use only the full page OCR results. This is done for the purposes of shaving time off the OCR operation. One of the areas where cell validation excels is when graphical elements exist to the left of the text (or what would be considered the first column). Performing a secondary OCR pass on graphics will be a waste of time. They are not text and will not produce usable text data. In these situations you can save time by turning this property from "False" to "True"

Property Default Value Information
Skip First Column False When turned to true, Cell Validation will skip all cells in the first column of the created N by N matrix.

You can verify cell validation was performed using the "OCR Testing Tab".

Select a page from the Test Batch and press the "OCR Page" button. This will perform OCR on only the selected page to test out the OCR profile.

After you press the "OCR Page" page button, a new tab will appear underneath the image, the "Diagnostics" tab. This tab has several images related to the OCR operation. With Cell Validation enabled, you will see an "IP Image", which is the first full text OCR iteration, and several "Cell" images which is the image used to OCR each cell. For this example, we split up the page into one row and three columns. So, there are three "Cell" images, one for each column of the text.


Cell val 7.png


Cell val 8.png


Cell val 9.png


Segment Reprocessing

Part of what the OCR Engine does is break up text on a document into segments. First segmenting out lines from the full page, then words from those lines, and finally individual characters from those words. Each character is given a certain confidence score based on how well the image character matches the text character the OCR Engine assigns it. However, we've seen OCR Engines generally perform better if you limit what it has to look at. There's a lot of competing information OCR has to deal with when analyzing a full document and breaking it out into segments. What if we could target whole words, phrases or lines that the OCR Engine recognized with poor overall accuracy and re-perform OCR just on those portions without the rest of the document getting in the way?

With Segment Reprocessing, we can! Segment Reprocessing first identifies segments by locating word and line segments the OCR engine recognized that are separated by large amounts of text.


Grooper highlights the identified segments of text recognized by the OCR Engine. Segments identified using a fixed width font are highlighted in blue. Segments using variable width fonts are highlighted in green.
Segment reprocess 1.png


Grooper then looks at the individual confidence scores of each character in the segment and calculates an average confidence score for the segment. The idea behind this is the overall confidence of a segment of characters is actually more important than the individual confidence of the characters. The difference between the words "Grooper" and "Groooer" is just one character off, but fundamentally changes the meaning of the word. If the "p" character was recognized with a poor confidence (Say 50%), and every other character at a high confidence (Say 99%), it would drop the average confidence of characters in that segment down (to 92%). With Segment Reprocessing, we can set a confidence threshold for the segment's average character confidence. If the segment's confidence falls below that threshold, Grooper will hand the OCR engine just the snippet of the image where that segment is located and reprocess it. See below for an even more dramatic example.

At first pass the OCR engine did a particularly bad job recognizing the word "Manufacturing".
The original image The OCR results (not so good)
Segment reprocess 2.png Segment reprocess 3.png


Mousing over the "Manufacturing" segment, we can see the average confidence is pretty bad, 69%. If we set the Segment Reprocessing Threshold higher than 69%, the OCR Engine runs again just focusing on the portion of the image containing that segment, giving us a much better result.
The original image The OCR results (much better)
Segment reprocess 4.png Segment reprocess 6.png


! Note the above example is an extreme example. You will likely want to use a fairly high reprocessing threshold, often above 90%. While that may seem high, for larger segments of text, if only one or two characters are dropping the overall confidence, it may score well above 90%. However, those one or two characters may drastically impact the text data. If you want to target these segments for reprocessing, your threshold will need to be fairly high.


Configuring Segment Reprocessing

With Synthesis enabled, Segment Reprocessing is a configurable property on OCR Profiles.

Selecting an OCR Profile, navigate to the "Segment Reprocessing Threshold" property. By default, this property is set to "0%'. This will disable segment reprocessing. To enable it, enter the average character confidence percentage threshold for segments you wish to reprocess. Any text segments with an average character confidence falling below this threshold will be OCR'd a second time, using only the portion of the image where that segment falls instead of the full image.


Segment reprocess 7.png


Segment reprocess 8.png

You can verify segments were reprocessed using the "OCR Testing Tab".

Select a page from the Test Batch and press the "OCR Page" button. This will perform OCR on only the selected page to test out the OCR profile.

After you press the "OCR Page" page button, a new tab will appear underneath the image, the "Diagnostics" tab. This tab has several images related to the OCR operation. Navigate to the "Execution Log". This will tell you how many segments were reprocessed, if any of them were repaired (meaning their original OCR result was changed), and the result of that change.


Segment reprocess 9.png


In this case, 8 segments fell below the 90% confidence threshold, and 3 of them were changed upon reprocessing. Note, this means not every segment is repaired through this process. Sometimes, the OCR Engine takes a second look and comes up with the same result as it did the first time.

Besides setting the Reprocessing Threshold, Segment Reprocessing has only one configurable property, the Segment End Ratio. This property controls how wide a gap between segments must be to establish the end of a segment. This is measured in relation to the segment's font size coming before it.

The general idea is any gap larger than two space characters should constitute the end of a segment. This will capture full lines of text in a normal paragraph text flow as a single segment. This will capture most form labels as a segment, as they normally are separated from each other on the page with a fair amount of space (usually more than two space characters). Same with items in a table. There's usually a good amount of padding between items in individual cells, so these items would be isolated as discrete segments.

However, there can be inconsistencies based on how the OCR engine determined the font's size, or individual character widths for variable fonts that result in Grooper seeing a space as larger or smaller than normal. This can result in over or under-segmenting the document. For example, below, the space between "Grooper" and "Industries" was determined to be larger than the normal gap, using the default 100%.

Segment reprocess 10.png

If we were to set the Segment End Ratio property to 150%, it would widen that gap between segments, resulting in each line being separated out as a single segment. If the gap were normally "two" spaces it would widen it to 150% of normal, or "three" spaces.

Segment reprocess 11.png

And on the other end of the equation, if we set the Segment End Ratio property to 50%, it would shorten the gap between segments, resulting in each word being separated out as a single segment. If the gap were normally "two" spaces it would shorten it to 50% of normal, or "one" space.

Segment reprocess 12.png

The Segment End Ratio property can help you fine tune your segment length depending on your needs. This property can be set anywhere from 0% to 400%.

The Exception to the Rule

There is one character that fundamentally changes how segmenting works, the colon. Once Grooper identifies a normal segment gap after a colon character ("two" space characters), it will always start a new segment, regardless of how much you increase the Segment End Ratio.

Segment End Ratio = 100% Segment End Ratio = 400%
Segment reprocess 13.png Segment reprocess 14.png

The reason behind this is how colons are used on documents. Typically, they themselves are used to break information out into meaningful segments. On the left side of the colon is some kind of label identifier. On the right followed by a gap is a piece of information that label pertains to. These labels are very important and the pieces of information they relate to are very important. Making sure both can be reprocessed independently tends to yield better OCR results and therefore data extraction down the line.