IP Profile (Node Type)

From Grooper Wiki
(Redirected from IP Profile)

This article was migrated from an older version and has not been updated for the current version of Grooper.

This tag will be removed upon article review and update.

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025 20232.72

perm_media IP Profiles are a step-by-step list of image processing operations (IP Commands). They are used for several image processing related operations, but primarily for:

  1. Permanently enhancing an image during the Image Processing activity (usually to get rid of defects in a scanned image, such as skewing or borders).
  2. Cleaning up an image in-memory during the Recognize activity without altering the image to improve OCR accuracy.
  3. Computer vision operations that collect layout data (table line locations, OMR checkboxes, barcode value and more) utilized in data extraction.

These operations generally fall into three categories:

  1. Archival Adjustments - These are permanent adjustments to the exported document's image.
    • Permanent image adjustments are performed when an IP Profile is executed during the Image Processing activity.
  2. OCR Cleanup - Image cleanup can dramatically improve OCR results.
    • However, they can also drastically alter the document's image. Image adjustments are temporarily applied to a document prior to OCR when an IP Profile is executed during the Recognize activity. This is useful for non-destructive image clean up to improve OCR results, keeping the document's pages as their original image to preserve their archival images upon export.
  3. Layout Data Collection - This includes visual information used for data extraction purposes (such as table line locations, barcode information, OMR checkbox states) as well as image features used for Visual classification.
    • Layout Data can be collected either during the Image Processing or the Recognize activities.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023). The first contains a Project with resources used in examples throughout this article. The second contains one or more Batches of sample documents.

Permanent vs. Temporary Image Processing

The Image Processing (Activity) activity permanently alters a document's image by applying an IP Profile. However, it is possible to temporarily clean up document images and revert back to the original document image. This is done during the Recognize activity.

For example, you may have a document where table lines are getting in the way of accurate OCR. However, if you remove these lines during the Image Processing activity, they will be permanently removed, making it difficult to review the documents in Data Review and changing the archival image stored later to something that no longer looks like the original document.

Instead, you can use an OCR Profile which references an IP Profile that has a Line Removal step during Recognize. The image will be temporarily changed according to the IP Profile. Then, OCR will run on the altered image. Last, the image will revert back to its original form, retaining the OCR results from the pre-processed image as well as the original image.

  • Furthermore, any image based data targeted by the IP Profile (such as the table line locations for this example) will still be saved to the Batch Page for later use.

For more information on both permanent and temporary image processing as a concept, visit the Image Processing (Concept) article.

Anatomy of the IP Tester Tab

IP Profiles are tested using either a Tester tab found when selecting IP Profiles, IP Groups or IP Steps. Understanding how to navigate this interface will make it easier to understand how to test and configure individual IP Steps, groups of steps in an IP Group and whole IP Profiles themselves.

After making an IP Profile, you will add the IP Steps to the node tree under the IP Profile. You can test the full IP Profile from either the IP Profile itself or on the IP Step.

  1. Click on the IP Step in the node tree.
  2. Click on the "Tester" tab.
  3. On this screen you can make edits to the IP Step properties and test the IP Profile.

IMPORTANT!!!

By default, anytime you test on an IP Step, you will test for the whole IP Profile. There are rare situations where you might want to see what an IP Step does in isolation outside of the rest of the IP Profile.

  1. To switch between testing the whole IP Profile and testing the IP Step in isolation, click this filter button.

IP Steps in the profile are listed, selected, and added in the node tree under the IP Profile object. IP Profiles are a sequential list of IP Steps, each one performing a image processing operation called an IP Command. This IP Profile is very simple. It only has one IP Step, using the Auto Deskew IP Command.

IP Steps are added to the list by right-clicking on the IP Profile object, mousing-over "Add Command", then selecting the category and IP Step. Changing the order of the steps is the same process of moving objects around in Grooper: either click and drag the object to another location, or hold "Ctrl" and press the up or down key on your keyboard.

Here, you can select a Test Batch to help you configure your IP Profile. All alterations to the documents in the Test Batch are done in memory when configuring an IP Profile. They will retain their original form unless the IP Profile is applied using the Image Processing activity.

Each IP Command has its own set of configurable properties. Here, you can adjust them as needed to fit the demands of your document set.

  1. Here you can change the Command properties.
  2. Here is where you will configure the properties for the specific IP Step.

The Diagnostics Panel is extremely helpful when configuring IP Command Properties and verifying steps are processing a document as intended. It contains a number of images for each IP Step related to how its IP Command is altering the image, including a before "Input Image" and after "Output Image"

Last but not least is the Document Viewer. This allows you to view the document selected in the Batch Selector. This window will also show you the selected image in the Diagnostics Panel.

Click here to return to the top of this section

How To

Create a new IP Profile

Before you create an IP Profile, you will likely want a Test Batch to verify its results. Be sure to create a Test Batch before creating an IP Profile

Add a New IP Profile

IP Profiles may be created and stored in a Content Model's Local Resources folder.

  1. Under the Local Resources folder, we have created a folder called "IP Profiles". Right click on the folder.
  2. Mouse over "Add" and click "IP Profile..."

  1. Name the IP Profile whatever you like. Here we just named it "IP Profile Example".
  2. Click "EXECUTE" to create it.

  1. This will create a blank IP Profile in the IP Profiles folder.
  2. If we go to the "Tester" tab, we see there is not a lot going on here yet.
  3. At the bottom we have "Batch Viewer".
  4. To the right we can see the selected document in the "Document Viewer".
  5. To select a Batch, click on the Batch selector icon at the top right of the "Batch Viewer".

Before going any further, the first thing you will want to do is select a Test Batch.

  1. When the window pops up, navigate through the folders until you find and select the Batch you want.
    • Here we have selected the "Wiki - IP Profiles [Batch]" Batch.
  2. Click "OK" to select the Batch.

Add IP Steps

IP Steps are the individual elements of an IP Profile. The IP Profile will execute each step, one after other, altering the image according to whatever IP Command the IP Step uses.

  1. To add a new step, right-click on the IP Profile.
  2. Mouse-over the "Add Command" selection. From here you can access all the different IP Steps.
  3. We are going to add an "Auto Deskew" step, which is found under the "Image Transforms" catergory.
    • This will accomplish two things. One, it will straighten up the image, making the final output document more visually appealing (an Archival Adjustment). Two, we will get better OCR results (an OCR Cleanup). All the text will be nicely aligned line by line, instead of potentially jumbled on different lines.

  1. When the "Add Command" window pops up, all you need to do is click "EXECUTE" to add the step.

  1. Now you should see the chosen IP Step in the node tree under the IP Profile.

Verify the Results

  1. To verify the results of the IP Profile, first select the IP Profile.
  2. Click on the "Tester" tab if not already there.
  3. Click the "Play" icon at the top of the "PROPERTIES" panel.

In the "Document Viewer" window, you can see the image was de-skewed by the single IP Step in our IP Profile. Now it's nice and straight, which is both visually appealing (an Archival Adjustment) and give us better OCR results (an OCR Cleanup). Notice a couple other things happened after we hit that "Execute" button as well.

  1. You will see a set of diagnostic images and an "Execution Log" text file for each IP Step in the IP Profile, seen in the highlighted "Diagnostics Panel" window. These images and logs can be useful for configuring the step's properties.
    • There will also be a before "Input Image" and after "Output Image" both for each individual step as well as the entire IP Profile.


Pressing the "Execute" button does not actually modify the selected document.

All alterations to the documents in the Test Batch are done in memory when configuring an IP Profile. Furthermore, any time you navigate to another document the IP Profile will "execute" on the temp batch (In other words, there's no need to press the Execute button unless you want to verify the results of the IP Profile on the currently selected document.

FYI

If you do want to permanently apply the IP Profile to a page, you can do so in an ad hoc manner by selecting a page and pressing the "Save Processed Page" button. This will permanently apply the IP Profile's steps to the page in the Test Batch.

Click here to return to the top of this section

FYI

While not discussed explicitly in the previous tutorial, you may also add "IP Groups" to an IP Profile.

IP Groups allow a hierarchy of IP Steps to be executed in an IP Profile. You can add IP Steps to an IP Group just as you do for an IP Profile. You can even add IP Groups to IP Groups for even further nested levels of image processing hierarchy.

IP Groups are used to:

  1. Organize complex IP Profiles.
  2. Create re-usable units of image processing which can be copied and pasted into other IP Profiles.
  3. Conditionally execute or skip a sequence of IP Steps.

Examples

Sample Configuration of an IP Profile for Permanent Image Processing

Permanent IP makes alterations to the archival version of the exported document. As permanent image processing is, after all, permanent, you must be careful about which commands you use. Most often, IP Profiles for Permanent IP are fairly small, performing only moderate adjustments to the image. More drastic alterations that improve OCR but dramatically alter the image are left for Temporary IP Profiles.

Before anything, you need a sample document set of the kinds of documents targeted by your IP Profile. Once you have these documents ready in a Test Batch, it's a good idea to evaluate them, getting an idea about the kinds of issues the document set has. Below are the documents we will be looking at and some of the issues involved with them.

One of the most common issues resolved by a permanently applied IP Profile is a skewed image. Not only will you improve the image visually for human readers, but deskewing an image is also critically important for OCR to run correctly. This document does not look too bad, but the background isn't totally white. Some of the paper's texture came through when the document was scanned. A lot of scanners have built in image processing functions that will clean this up, but this functionality is somewhat basic and can cause problems down the road. Grooper's IP capabilities are much more robust, and there may be cases where not cleaning up the image actually produces a better result (more on that later).


This image is upside down. As a rule of thumb, if a document is hard for a human to read, it'll be hard for an OCR Reader to recognize the document's characters. There's not too much to fix on this document, but keep in mind an IP Profile will be applied to all documents. Part of the IP Profile creation will be to make sure what positively impacts one document doesn't negatively impact another.


Border cropping is very common when permanently applying an IP Profile. Black borders around images are typically a result of scanning the image and not actually part of the document. Border Crop or Border Fill can get rid of that border, giving an image that is more representative of the actual document. All those black pixels around the edge of the document are also most definitely getting in the way of OCR. However, you might have to ask yourself the question: Is the border part of the original document? If so should we keep the border for the archival export of the document? Take a look at the transcript on the left. At first glance, this border looks like it's a border from scanning the image, but it may actually be part of the document.

The first step we will add is an "Auto Border Crop" command. This command will only crop an image if it detects a border in an established zone. So, only those documents with borders should be affected.

  1. Right click on the IP Profile to add a step.
  2. Mouse-over "Add Command".
  3. Mouse-over "Border Cleanup" and then click on "Auto Border Crop".

  1. Click "EXECUTE".

For our document, the basic settings did not alter the image. There is no "Output Image" in the Diagnostics Panel. You can also see clearly there is still a black border on the image in the Document Viewer.

The border is quite larger than normal. Changing the "Border Region Size" to 0.75 in on all four sides will properly crop this document. The diagnostic images are useful when configuring any IP Step's properties. In the case of Auto Border Crop, you'll want to use the "Zoning" image to configure the zone where the border falls.

The Border Region zone is shown by the thin red rectangle. It encapsulates some, but not all of the border (from the edge of the document to the edge of the Border Region) using the default of 0.25 in.

Let's go ahead and change the Border Region Size.

  1. Click on the Auto Border Crop IP Step in the node tree.
  2. Change the value of the Border Region Size to 0.75in.

By increasing the Border Region Size to 0.75 in you can see the zone now entirely encapsulates the border.

Now, you can see the step removes (most of) the border.

The remaining part of the border can be removed using the "Border Fill Settings" or adding an "Border Fill" step.

Next, we need to take care of the upside down image. This can be done with an "Auto Orient" step. This step will automatically detect the orientation of the image based on the text on the page. If the page is upright, Auto Orient will detect the text reads like normal and do nothing. If the command detects the text is upside down, it will re-orient the page, turning it right-side up.

  1. Right click on the IP Profile.
  2. Mouse-over "Add Command".
  3. Mouse-over on the Image Transforms catergory and click on "Auto Orient..."

  1. Navigate to the IP Profile on the node tree.
  2. Click on the "Tester" tab.
  3. Click on play icon to test the IP Profile.

  1. Upon execution, the upside down page has been righted.

Next we need to take care of the skewed text. For this, we will use the Auto Deskew command.

  1. Right click on the IP Profile in the node tree.
  2. Mouse-over "Add Command"
  3. Mouse-over the "Image Transforms" catergory and click on "Auto Deskew..."

  1. When the "Add Command" window pops up, click "EXECUTE" to create the step.

  1. When you go back to the IP Profile, click on the play button to run the profile on the document.
  2. This will resolve issues around skewed documents.

Further Decision Making

The previous three IP Commands are very commonly found in Permanent IP Profiles. Since they determine automatically if a page is skewed, has a border, or is oriented incorrectly, there is significantly less risk that these commands will negatively impact the rest of the document set (less risk, but not none!). For any other IP Commands, you have one major question to ask.

How much do you want to alter the original image?

Anything you do with a Permanent IP Profile will affect the final document you send out during export. Keep in mind, there are ways to temporarily apply an IP Profile to improve OCR Results later.

When making your choices on what other IP Commands to add to a Permanent IP Profile you should focus on altering the document for human readability without negatively impacting machine readability (i.e. OCR results).


For example, the "Brightness Contrast" Color Adjustment command can be very helpful to increase both the human and machine readability for some documents. It modifies the brightness or contrast of an image (or both).

Increasing the Brightness property to "10" and the Contrast property to "40" cleans up the first page of this document nicely. The background has been removed while preserving the text.


However, you be sure to verify it does not negatively impact other pages.

On the second page, this command also starts to remove the table lines on the page. Grooper can use the positions of these lines for certain data extraction methods, such as Table Extraction. However, if you get rid of them with a Permanent IP Profile, you won't be able to find those lines later.


Furthermore, the IP Profile runs on all documents in the batch. Be sure it isn't adversely affecting other documents.

This page is fairly light to begin with. While the typewritten text is still there, the handwritten text is starting to fade. OCR Engines may not do well recognizing handwritten text, but if you ever want a human to look at this document and be able to read the handwritten text, increasing the contrast is not well suited for this document.

For situations like this, you may need to find a happy middle ground.

The "Contrast Stretch" command is often used to help improve the image quality of documents. It works to normalize the image's contrast. It adjust the contrast so that the lightest pixels are turned pure white and darkest are turned pure black.

It doesn't do quite as good of a job as the "Brightness Contrast" command but it does brighten up the whites and darken the black parts of the image a bit.
And it does so without losing the handwritten text on this document.

Just as you need to think about how one command will adversely effect other documents in the set, you can take advantage of documents that are very different from others in the set.

This document is the only color document and has some significant issues with how it was scanned. Likely it's actually a picture taken of a screen. Portions of the image are light portions are dark. It's white balance is all off. Since all the other documents are black and white or greyscale, we can reliably use an "Auto White Balance" command to make it look a little better.

FYI

There are some ways to leverage image based information to create conditional logic around what steps to execute in an IP Profile. Visit the Conditional IP section of this article for more information.

Make Adjustments

It's very rare when you make an IP Profile and everything works perfectly without doing some unit testing and adjusting some properties on a step or two. Take our "Auto Border Crop" step and our two documents with borders. As we configured the step, this is the results of our two transcripts.



Depending on what route we want to take, there's still some cleaning up on these borders we can do. The easiest thing to do at this point would be to add a "Border Fill" command to clean up the black edges on the left side of these documents.

  1. Right-click on the IP Process in the node tree.
  2. Mouse-over "Add Command".
  3. Mouse-over the "Border Cleanup" catergory and then click on "Border Fill..."

  1. When the "Add Command" window pops up, click "EXECUTE" to create the IP Step.

  1. One of the main adjustments you will make when using a Border Fill command is setting the "Method" property. By default, it is set to "Exclusive". This means anything fully outside the border zone (seen in the image below in red) will be filled with the selected "Fill" color. This means anything fully outside the border zone will be filled with the selected "Fill" color.

  1. If we go back to the IP Profile, test the IP Profile, and take a look at the zoning.jpg section in the Diagnostics log...
  2. ... we'll see the red line that indicates the border zone. Note the red zone intersects the black border for this document.

If we set the method to "Inclusive" it will include borders that overlap the border zone, dropping them out.

  1. Go to the Border Fill IP Step in the node tree to change the method.
  2. Click the arrow next to the Command property to access the subproperties.
  3. Click the hambuger icon to the right of the Method property to access the drop down menu.
  4. Click Inclusive to set the property. Don't forget to save before going back to the IP Profile to test.

  1. Going back to the IP Profile and testing, we can see we now have an "Output Image".
  2. We see that the outer border has been removed.

But what if we start looking at these documents and don't actually want to remove the border for the transcript on the left but do want to remove the border for the right?


Keep this border Remove this border


For this document set, there's a trick we can do using the properties of "Auto Border Crop" and "Border Fill" to do this. First we will configure "Auto Border Crop" to crop the blue transcript and not the brown one. Looking at the brown transcript, that border is actually part of the document. As such, there is a sliver of white pixels around the document. We can use the "Maximum Border Weight" property to only drop out perfectly solid borders.

  1. Navigate to the "Auto Border Crop" IP Step in the node tree.
  2. After accessing the Command subproperties, we're going to change the "Maximum Border Weight" from 90% to 100%.

  1. Going back to the IP Profile and testing, we can see that we no longer have an "Output Image" under the "Auto Border Crop" diagnostics.
  2. This means the IP Step did not alter the image, so we have the original border.

  1. If we test the IP Profile on the blue transcript, we do receive an "Output Image" for the "Auto Border Crop".
  2. Since the blue transcript does have a solid black border around the image, it does crop the image. Or at least it mostly does. There is a slight border on the left and right sides still, but we'll fix that next.

  1. Go back to the "Border Fill" IP Step in the node tree.
  2. We are going to adjust the "Border Region Size" to "15pt". We can also set the "Method" back to "Exclusive" now.

  1. Testing the IP Profile now will show an "Output Image" in the diagnostics.
  2. The area outside of the border is now filled in with a white color.

  1. Testing on the brown transcript does not show an "Output Image" in the diagnostics.
  2. For the brown transcript, the border intersects the border zone. So, using the "Exclusive" method keeps the border from being dropped out.

We are left with two originally bordered images, one of which was removed and one of which was not.


Granted, this really only worked because of how these documents came into Grooper. We got lucky in that there was a slight amount of white pixels on each edge of the brown transcript, making the "border" not perfectly solid. That being said, a lot of how you configure Grooper's properties to target certain documents and not others is based of analyzing certain aspects of the documents. We wouldn't have even known to try this approach if we hadn't noticed that border on the brown transcript wasn't a true border.

Click here to return to the top of this section

Sample Configuration of an IP Profile for Temporary Image Processing

For temporary image processing, we don't need to be concerned with how this image will look upon export. We only need to concern ourselves with cleaning up the image to improve OCR results.

The general plan is (as much as possible) get rid of anything on the page that is not text. This way non-text artifacts on the page will not interfere with the OCR Engine recognizing actual text characters. If these pixels simply are not present, the OCR engine won't have to figure out if they are part of a line, word, or character when segmenting the image. Similarly, if they aren't part of a character, once the image is segmented all the way down to an individual character, they won't confuse the OCR engine when it comes time to recognizing what text character that character segment should be.

Before anything, you need a sample document set of the kinds of documents targeted by your IP Profile. Once you have these documents ready in a Test Batch, it's a good idea to evaluate them, getting an idea about the kinds of issues the document set has. Below are the documents we will be looking at and some of the issues involved with them.

This document is full of interference for OCR. Table lines, check boxes, and partially shaded headers can all cause problems for accurate OCR results. Furthermore, we likely will want to use the line and box positions later during data extraction. A Temporary IP Profile can be configured to remove this elements but store their locations in memory for later use.

This too is filled with table lines, as well as having "negative regions" where portions of white text are on a black background. OCR must be able to read black pixels. So, we will switch the white text to black during image processing. It will also need to be turned into true black and white instead of grayscale.
This document may seem simple, but again, each command runs on each document in the document set. We will need to make sure the IP Profile works for each document in the set.
Last, this document will give us a few things to consider. It has a large border that is part of the document. It has a logo and other artifacts that could be removed. As well as these larger non-text artifacts, it will also end up having small specks that could interfere with OCR as well.

OCR absolutely must work with a black and white image. While OCR engines will turn image black and white on their own, they don't always do a great job at it. Furthermore, you have no control over how the OCR engine turns the image black and white. Grooper's image processing capabilities allow for greater configuration of how an image is turned into a black and white image before handing it to the OCR engine. The vast majority of temporary IP Profiles will contain a "Threshold" or "Binarize" step to convert color and grayscale images into true black and white.

Knowing this, let's use a "Threshold" command as our starting point.

  1. Right Click on the IP Profile.
  2. Mouse-over "Add Command".
  3. Mouse-over the "Format Conversion" catergory and then click on "Threshold".

  1. When the "Add Command" window pops up, click "EXECUTE" to create the IP Step.

This document was a grayscale image previously, and now has been turned black and white using the Auto thresholding method.

  1. In the "Diagnostics" panel, we can see that we have an output image for the "Threshold" IP Step.
  2. As you can see, the gray background behind some of the text (such as the portion highlighted here) has been turned white, leaving us a totally black and white image.

We are going to keep the default settings for this step. For more information about thresholding methods, visit the Binarize article.

FYI

The only difference between the "Threshold" IP Command and the "Binarize" IP Command is what bit depth (or color depth) format the altered image takes.

  • "Threshold" will convert the image into a true 1-Bit black and white image. The pixels can be either black or white and nothing else (This bit depth allows for colors, or 2 colors, white and black).
  • "Binarize" actually converts the image to an 8-Bit Grayscale image. This means the pixels can be black, white or multiple shades of gray in between (This bit depth allows for 2⁸ colors, or 256 colors, white, black and 254 shades of gray). However, only the white and black channels are used. Functionally, this gives you a black and white image in an 8-bit format.

For most operations, these two IP Commands are interchangeable. For example, if an OCR engine is handed an image processed by the "Binarize" command, it's still black and white even if it's in the grayscale format. The results will be no different than if it were handed an image processed by the "Threshold" command.

However, if a certain IP Command requires a bit depth larger than than single bit black and white (such as some of the "Filter" command's options), the "Binarize" IP Command allows the IP Profile to hand the next step an 8-Bit Grayscale image.

Moving onto the next document, we can see this document was indeed turned black and white, but there's another problem we have to deal with.

  1. This document has labels such as "Loan Terms" (here, highlighted) and "Projected Payments" that are white text on a black background. These are what we will call "inverted labels". OCR engines expect text to be black pixels, not white. So, they aren't going to recognize the text in these inverted labels.

This is a very common problem. Grooper's "Negative Region Removal" command is designed to address this.

  1. Right-click on the IP Profile.
  2. Mouse-over "Add Command".
  3. Mouse-over the "Feature Removal" catergory and then click on "Negative Region Removal..."

  1. When the "Add Command" window pops up, click on "EXECUTE" to add the IP Step.

  1. Now if we test this IP Profile, we have an output image for the Negative Region Removal IP Step.
  2. Upon executing this step, the inverted label is now changed to black text on a white background, allowing for OCR to properly recognize the text.
    • Many OCR engines, such as Transym, have similar negative region inversion capabilities. However, these capabilities are "black boxed". At best, you can turn the operation off or on, but you will not be able configure it beyond that. The "Negative Region Removal" command, allows for greater configuration of the detection and removal of these regions. For example, you may notice the label is now outlined in a black border. You can actually remove that border by changing the Outline Thickness property from 1pt to 0pt.

Recall the three major reasons for image processing in Grooper: (1) Archival Adjustments (2) OCR Cleanup, and (3) Layout Data Collection. This next step will focus on getting some layout data (3), with the added benefit of helping out our OCR a little bit (2). This will also be the first step that illustrates the importance of configuring an IP Command's properties to narrow down what you do and don't want to remove from a document.

  1. We will use a "Box Removal" command to both remove checkboxes (to improve OCR accuracy) and determine if they are checked or blank through OMR (for data collection down the road).

OMR stands for Optical Mark Recognition. OMR has been around for even longer than OCR. Remember back in school when you took a test and filled in bubbles on an answer sheet with a No. 2 pencil? Well, those answer sheets were graded by OMR! The answer sheet was fed into a scanner (probably a Scantron) that would detect if a bubble was filled in or not.

Grooper is doing something similar here. The main difference, is first Grooper has to find the box! The "Box Removal" command first detects boxes and save their locations on the page to the LayoutData.json file attached to that page. Once it does, it will check to see if there are any marks inside the box, if any pixels are filled within the boundaries of the box. If they are, it will record that box as "checked" in the LayoutData.json file, or "unchecked" if blank. Last, it will remove the boxes from the page, clearing the way for better OCR results.

  1. Right click on the IP Profile.
  2. Mouse-over "Add Command".
  3. Mouse-over the "Feature Removal" catergory and then click on "Box Removal..."

  1. When the "Add Command" window pops up, click "EXECUTE" to add the IP Step.

  1. For this example, we're goign to take a look at the Box Removal IP Step itself.
  2. Click on the "Tester" tab of the IP Step.
  3. Click on the play icon at the top of the tab to test the IP Profile.

Based on the "Output Image" you can see the boxes on this document have been removed, such as the ones highlighted.

If you come across a situation where Grooper is not detecting and removing the boxes, you may want to see if editing the Minimum Size Range property would help.

  1. By default, the Minimum Size Range is set to 6pt.

If we were to set this property to 7pt, Grooper would not have detected the boxes on this Closing Disclosure document.

Remember that any adjustments you make affect the whole Batch and not just an individual document. If you make an adjustment to account for an issue on one document, check your other documents to make sure you are still getting the desired result.

When adding a "Box Removal" IP Step to the temporary IP Profile, Grooper automatically detects these boxes and adds them to the layout data. You do not need to add a separate "Box Detection" step to the IP Profile.

But what about this "Employee Termination Form"? The boxes are still there. "Box Removal" failed to detect any boxes. Why?

The boxes here are not very square at all. They are taller skinnier rectangles.

  1. We can correct this by adjusting the Minimum Aspect Ratio property. By default it is set at 75%.
  1. When we lower the Minimum Aspect Ratio to 70%, we see that Grooper now detects and removes the boxes.

One of the most common temporary image processing adjustments for OCR cleanup is the "Line Removal" command. Lines are present on most documents in one way or another. They are used to create and divide tables, sections or individual fields on a document. This is great for humans reading a document! They act as visual dividers of information. They are not so great for OCR. Simply removing lines, in most cases, will greatly improve your OCR results. We will add a "Line Removal" command, and look at some common configuration issues.

  1. Right click on the IP Profile.
  2. Mouse-over "Add Command".
  3. Mouse-over the "Feature Removal" catergory and then click on "Line Removal..."

  1. When the "Add Command" window pops up, click "EXECUTE" to add the IP Step.

You may notice, "Line Removal" has a ton of configurable properties. That's partly because of how important removing lines is to get good OCR results. Rather than putting a black box on this operation, we want to give you a high degree of control when it comes to how lines are detected and how they are removed.

For the most part, these default settings work quite well. These defaults are configured to detect and remove most lines on most documents. The extra configurability is there should you need it. As you can see on our "Application for Cow Ownership" document, "Line Removal" removed all those lines without us lifting a finger.

However, check out Page 3 of this "Closing Disclosure" document. There's still lines there. They are faint, but we can see the default settings did not remove them.

Before we get to crazy about fine-tuning our "Line Removal" commands configurations, sit back for a second and think about what the problem is. These lines are indeed a little more light on the original image. Whenever we thresholded the image in our very first step, using the Auto method, part of these lines were translated as white pixels, and part as black, but not enough to make a solid line. What if we could use a different thresholding method just for "Line Removal" to make these lines come out better before detection?

We can! That's what the Binarization Settings are for on every IP Command in which they appear. In fact, every step we've added so far has this setting. This allows you to use different thresholding methods for different IP Commands. Perhaps for OCR'd text the Auto method works better to turn an image black and white, but, in this case, the Dynamic method is going to allow allow those faint lines to come through clearer as solid black lines, allowing us to detect and remove them.

First we need to move the "Threshold" step from the first step down the list after our "Line Removal" step. Order of operations is very important to IP Profiles. Each step alters the image and hands that altered image to the next step. As we have it setup so far, this IP Profile is already handing Line Removal a black and white image. So, changing the 'Threshold Method' will do nothing. But if we wait to threshold the image until after "Line Removal", we can utilize a different thresholding method before we turn the whole image black and white for good with the "Threshold" step.

  1. Click and drag (or hold CTRL and press the down arrow) to move the "Threshold" step below the "Line Removal" step in the IP Profile.
  1. Next, select the "Line Removal" step in IP Profile
  2. Expand the Binarization Settings property.
  3. Select the Method property, clicking the hamburger icon to the right to access the dropdown.
  4. Using the dropdown menu, change this setting from Auto to Dynamic.
  1. Select the "Binarized" diagnostic image to see how the image was processed using Dynamic thresholding.
    • As you can see, this is probably not the black and white image we want to use for OCR. However, all these lines are nice solid black lines that can be easily detected by "Line Removal".
  1. Select the final "Output Image" diagnostic image to see the end result.
    • We have a much cleaner image with all these lines detected and removed, without adjusting any of the myriad "Line Removal" settings.

Next, you may want to look for large artifacts you can remove, such as the black border on this image. The "Blob Removal" command is perfectly suited for this. We can tell the "Blob Removal" command to look for contiguous collections of pixels above a certain size.

  1. Right click on the IP Profile.
  2. Mouse-over "Add Command".
  3. Mouse-over the "Feature Removal" catergory and then click on "Blob Removal..."

  1. When the "Add Command" window pops up, click "EXECUTE" to add the IP Step.

We will simply look for blobs that are wider than 5 inches. We can pretty much guarantee ourselves that a text character is going to be smaller than 5 inches wide on any document. Setting a Minimum Width of 5in will get rid of what we want, without getting rid of any text pixels for OCR.

  1. Expand the Width property and set the Minimum property to 5in

As well as large artifacts getting in the way of good OCR, we might want to get rid of small specks as well. The trick is, we will want to remove as many pixels that are not text data, without removing specks that could be text data. After all, a period or a comma can look a lot like a random speck elsewhere on the page. The "Speck Removal" command will help us do this.

  1. Right click on the IP Profile.
  2. Mouse-over "Add Command".
  3. Mouse-over the "Feature Removal" catergory and then click on "Speck Removal..."

  1. When the "Add Command" window pops up, click "EXECUTE" to add the IP Step.

  1. Change the Max Speck Size to 3px

If you would like, you can select the "Dropout Mask" from the Diagnostics to see all the tiny specks being removed from this image. Some of these specks are random noise, some of these specks are trivial text characters, but some of these are actual character data we want to preserve.

This is what the Quiet Zone Size property is for. It will create a buffer zone around larger pixel segments, such as text characters. If specks fall in this buffer zone, they will not be dropped out.

  1. Adjust the Quiet Zone Size property to 4pt, 2pt.
    • This will create a quiet zone of 4pt to the left and right of a character and 2pt above and below. You can also specify, left, right, top and bottom zones individually by expanding the Quiet Zone Size and adjusting them there.

These Quiet Zone settings are helpful for targeting small specks you do want to remove, while retaining text characters that would otherwise look like specks.

There are other adjustments we could probably make to clean up these documents, but at some point you have to stop tinkering and put your IP Profile into action.

Temporary IP Profiles are executed as part of an OCR Profile. During the Recognize activity, the IP Profile's steps will be applied to a temporary copy of the document's image. OCR will then be performed on that temporary image according the the OCR Profile's settings.

  1. To set the temporary IP Profile select the IP Profile property on an OCR Profile.
  2. Using the dropdown menu, select your IP Profile from the Node Tree.
  1. To test the OCR Profile, go to the "Tester" tab.
  2. Select a page from the Batch and click on the play button to test the OCR Profile.
  3. Clicking the blue icon next to the play button will take you to the diagnostics, which will open in a new browser tab.

When using the "OCR Testing" tab, you can see the temporary image your IP Profile hands the OCR engine.

  1. When you are on the Diagnostics page, you can click on the "IP Image.jpg" to see what Grooper is feeding to the OCR Engine.
  2. This shows you what the page looks like after it has been run through the IP Profile.
Click here to return to the top of this section

Conditional IP

What happens when you have documents in your document set that just don't fit your IP Profile? Perhaps one configuration of a Box Removal command works for most of the documents, but there's one type of document that needs an entirely different configuration. What happens if most of the documents in your set perform well using the standard "Auto" thresholding method, but one works better using "Adaptive"?

In these situations, you may be able to use conditional logic via the "Should Execute" and "Next Step" expressions on IP Steps and IP Groups in an IP Profile. These expressions allow us to use snippets of .NET code to access information about the image or steps in the profile, and use them to determine if and when a step should run in an IP Profile.

Example: Should Execute Based on Classify Image

Different IP Steps, IP Groups or even entire IP Profiles can be executed based on the results of the "Classify Image" command. The "Classify Image" command compares an image against a set of sample images and classifies the image based on which sample it is most similar to. It does this by analyzing the color space of an image. For example, the RBG color space is made up of a red channel, a green channel, and a blue channel. The similarity would then be based on how similar the information in these three channels is to another image. For example, if a sample image has a high value in the red channel but a low value in the blue channel, it would not match an image that has a high blue channel but a low red channel.

For this example, we will create a conditional expression for thresholding these two documents.



The blue transcript is a good candidate for using "Adaptive" thresholding over the "Auto" method.

Auto Adaptive


However, the brown transcript is handled better by the "Auto" thresholding method.


Auto Adaptive


We will use the "Classify Image" command to have one image use "Auto" and the other "Adaptive".

Add IP Steps

This IP Profile will have three IP Steps: One "Classify Image" and two "Threshold" commands, one of which uses the "Adaptive" method and the other which uses the "Auto" method. Here, the two "Threshold" commands have been renamed accordingly.

Give Classify Image Sample Images

  1. Select the "Classify Image" IP Step in the node tree.
  2. We are going to copy the image from this blue transcript document to use as a sample image. Right-click on the page you want to copy (make sure you right click on the page object and not the folder).
  3. Click "Copy" to copy the image to your clipboard.
  4. Click the ellipsis button to the right of the Sample Images property.

  1. This will bring up the "Sample Images" window. Click the clipboard icon to paste in the image you copied.

  1. Give this sample a name using the "Sample Name" box.
  2. Click "OK" when finished.

  1. This will add the image to the list of Sample Images on the left. Click "OK"

Select the Color Space to Analyze

Next, select the color space you wish to use to classify the image. There are a variety of color space options, each of which measures different channels making up a document's color. For this example we are using the HSV color space, which measures hue, saturation, and value (pixel intensity).

Classify Image's Execution Log shows the measurements for the selected color channel and how similar they are to the sample image. All color spaces will have "Channel 1", "Channel 2", "Channel 3" and "Entropy" listed under "Source Image Features" These are the image based measurements of the selected image. The three channels correspond to the information in the three channels of the selected color space. For HSV, "Channel 1" is the hue, "Channel 2" is the saturation, and "Channel 3" is the value. "Entropy" is a measure of how "busy" the image is. The more black text on a document, the higher the entropy measure will be.

The similarity score to each trained image is seen under "Results for Trained Image Classes". We only trained one image, the "Blue Transcript", which is coming in as "100%" similar.

The image is assigned a classification based on how similar the three channels and the entropy are to the trained images (assuming it meets the "Minimum Similarity" score). This image is classified as "Blue Transcript". This can be verified by its "Class Name".

We can look at the Execution Log for the brown transcript, and we see it came in at a lower similiarity of 86.51%. However, if we look at the "Class Name" it's coming in as the Blue Transcript classification.

  1. With a similarity of 86.51%, the image is too similar to the Blue Transcript for it to to be classified differently.
  2. By default, the Minimum Similiarity property is set to 85%. Since 86.51% is higher than 85%, this brown transcript is being recognized as the "Blue Transcript" document.
  3. If we change the similarity score to 90%, the brown transcript will no longer be recognized as the "Blue Transcript" document.

The Class Name for the brown transcript now shows as "None". Now that we have a benchmark that can tell one image from the other, we can use it to conditionally threshold the image.

Set the Should Execute Expression

Next we need to figure out and apply our logic for thresholding these documents. If the image is classified as "Blue Transcript" we want to use the Adaptive method. Otherwise, we want to use the Auto method. The next step should be the Threshold (or Binarize) command using Adaptive thresholding.

  1. Select the "Threshold - Adaptive" IP Step in the node tree.
  2. Select the "Should Execute Expression" property and click the ellipsis button at the end.

  1. The expression we will write will reference the results of the "Image Command". The basic idea here is if the image was classified as "Blue Transcript" execute the command, otherwise do not. The code expression we can use is as follows: Results.Classify_Image.ClassName = "Blue Transcript"
    • Note, the class name you enter must match the sample image's name exactly.
  2. Click "OK" when finished.

Success! The blue transcript was turned black and white using the Adaptive method. However, the real test will be if the brown transcript skipped the "Threshold - Adaptive" step and went straight to the "Threshold - Auto" step.

Although it may appear as if the "Threshold - Auto" step is skipped, all steps after the step using the "Should Execute Expression" are still applied. It only appears as if it doesn't run because it's being handed an image that is already black and white. The next step technically still runs. Should Execute Expressions only determine if a step is applied or not. They do not have anything to do with the order in which other steps are applied (The "Next Step Expression" can determine order).

Did the step execute on the brown transcript, which was not classified as the "Blue Transcript" image? It did not! The "Threshold - Adaptive" step was skipped and the next step in the sequence, "Threshold - Auto" ran as normal.

Click here to return to the top of this section

Example: IP Groups and the Next Step Expression

Imagine for the example above, we wanted one set of IP Commands to run on the brown transcripts and another to run on the blue transcripts. Since we already know we can classify these documents seperately and use that information to determine if a step should execute, we can use IP Groups to tell an IP Profile to execute a collection of IP Steps and conditionally based on the image classification and determine what happens next using the "Next Step Expression" property.

We will keep this example fairly simple. On the left we have a blue transcript and the right a brown one.

For the blue transcript we want to execute two IP Commands: Threshold using Adaptive method and a basic "Line Removal"

For the brown transcript we want to execute two IP Commands: Threshold using Auto method and a basic "Speck Removal"

We don't want the brown transcript to run "Line Removal" and we don't want the blue transcript to run "Speck Removal". We will use the same "Classify Image" command we used in the previous example to classify the blue transcript as "Blue Transcript".

Add an IP Group

IP Groups are collections of IP Steps in an IP Profile. This allows you to nest a series of steps within an IP Profile. You can think of IP Groups as mini-profiles or sub-profiles that can be used as a single step in the execution sequence of an IP Profile.

  1. To add an IP Group, right click an IP Profile in the Node Tree.
  2. Mouse over "Add".
  3. Click on "IP Group".

  1. Name the IP Group whatever you'd like. We will name this one "Blue Transcript IP"
  2. Click "EXECUTE".

Add the steps you wish the group to execute as if you were adding them to an IP Profile. This group will add a Threshold step using the Adaptive method and a basic Line Removal Command. The IP Group will be nested as a child of the IP Profile, with its own steps nested as children of the IP Group.

  1. Navigate back up the Node Tree to the parent IP Profile. You will see the IP Group as a step in the profile.
  2. Without any conditional logic applied, the IP Steps in the IP Group simply run as if they were steps in the IP Profile.

Add the Remaining IP Steps

  1. Next, we will add the steps we want the brown transcript to execute: A Threshold step using the Auto method, and a basic Speck Removal command.

Set the Should Execute Expression

  1. This will be similar to the previous example. We will use the classification results of "Classify Image" to only execute the IP Group if the image was classified as "Blue Transcript".
  2. Navigate to the IP Group (here named "Blue Transcript IP"). Click the ellipsis button to the right of the Should Execute Expression.

  1. We can use the same code expression we used before: Results.Classify_Image.ClassName = "Blue Transcript"
  2. Click "OK" when finished.

This will only execute the steps in this IP Group if the should submit expression evaluates to "True". In this case, if the image is classified as "Blue Transcript". However, the remainder of the IP Profile still runs for those images. Seen below, Speck Removal still executes. We want to tell the IP Profile to stop running at the end of the IP Group.

We can change the order of operations for this profile using the Next Step Expression property.

  1. Click the ellipsis button to the right of the Next Step Expression property.

The "Next Step Expression" dictates what happens next after the IP Step or IP Group executes. What we want to do is tell the IP Profile to stop running for the blue transcripts after the IP Group finishes, but continue onto the "Threshold - Auto" step for the brown transcript.

The code expression to execute this logic is as follows: If(Results.Classify_Image.ClassName = "Blue Transcript", Nothing, Steps.Threshold_Auto)

This follows the logic of If(condition, if condition is met, do this, otherwise do this)

The condition here is that the image is classified as a "Blue Transcript" by the Classify Image command. "Nothing" here means the IP Profile will stop processing and perform no more of the IP Steps in the profile. "Steps.Threshold_Auto" means the next step to execute will be the "Threshold - Auto" step in the profile.

  1. Enter the expression into the text area of the window.
  2. Click "OK" when finished.

  1. Now, the blue transcript quits running after the IP Group is finished. So, the rest of the IP Profile (the "Threshold - Auto" and "Speck Removal" commands) is skipped entirely.

Click here to return to the top of this section