IP Profile (Object)

From Grooper Wiki

An IP Profile is a sequence of instructions for image processing. They are composed of IP Steps and IP Groups. Each step or group of steps contain IP Commands, which define image processing operations.

These operations generally fall into three categories

  • Adjustments to the archival exported image
  • Image cleanup to improve OCR results
  • Image-based data collection, including layout data (such as table line locations, barcode information, OMR checkbox states, and more) as well as image features used for Visual classification

Permanent vs. Temporary Image Processing

The Image Processing activity permanently alters a document's image by applying an IP Profile. However, it is possible to temporarily clean up document images and revert back to the original document image. This is done during the Recognize activity.

For example, you may have a document where table lines are getting in the way of accurate OCR. However, if you remove these lines during the Image Processing activity, they will be permanently removed, making it difficult to review the documents in Data Review and changing the archival image stored later to something that no longer looks like the original document.

Instead, you can use an OCR Profile referencing an IP Profile containing a Remove Lines command during Recognize. The image will be temporarily changed according to the IP Profile. Then, OCR will run on the altered image. Last, the image will revert back to its original form.

  • Furthermore, any image based data targeted by the IP Profile (such as the table line locations for this example) will still be saved to the Batch Page for later use.

For more information on both permanent and temporary image processing as a concept, visit the Image Processing (Concept) article.

Anatomy of the IP Profile Tab

Upon selecting an IP Profile in the Node Tree, you will edit it using the IP Profile tab. This is what the screen will look like. As you can see, there are several windows that make up this screen.


Here IP Steps in the profile are listed, selected, and added. IP Profiles are a sequential list of IP Steps, each one performing a image processing operation called an IP Command. This IP Profile is very simple. It only has one IP Step, using the Auto Deskew IP Command.

IP Steps are added to the list using the "Add" button, deleted using the "Delete" button, and you can change the order in which they process using the "Move Up" and "Move Down" buttons.

Here, you can select a Test Batch to help you configure your IP Profile. All alterations to the documents in the Test Batch are done in memory when configuring an IP Profile. They will retain their original form unless the IP Profile is applied using the Image Processing activity.


Here, you will see a list of processing results for each step in the IP Profile. Each step will be listed, with the time it took that step to run, whether or not the image was modified, and if the image was flagged by the step. It also contains an "End Result" containing the sum total run time for the whole IP Profile, it the profile modified the image, and if it has a flag by the end of it.


Each IP Command has its own set of configurable properties. Here, you can adjust them as needed to fit the demands of your document set.



Furthermore, using the "Selected Step" tab, you can create some conditional logic around if and when to apply certain IP Steps, using a snippet of Microsoft.Net code. This is done using the "Should Submit Expression" and "Next Step Expression" properties.



Lastly, you can use the "IP Profile" tab to add a description for the profile for other users to get more information as to what the profile does and is used for.


The Diagnostics Panel is extremely helpful when configuring IP Command Properties and verifying steps are processing a document as intended. It contains a number of images for each IP Step related to how its IP Command is altering the image, including a before "Input Image" and after "Output Image"


Last but not least is the Document Viewer. This allows you to view the document selected in the Batch Selector. This window will also show you the selected image in the Diagnostics Panel.



How To

Create a new IP Profile

Before you create an IP Profile, you will likely want a Test Batch to verify its results. Be sure to create a Test Batch before creating an IP Profile

Add a New IP Profile

IP Profiles may be created and stored in a Content Model's local resources folder or in the "IP Profiles" folder in the "Global Resources" folder. However, by far the most common place to create an IP Profile is in the "IP Profiles" folder.

1) Navigate to the "IP Profiles" folder via this path in the Node Tree: Root Node > Global Resources > IP Profiles

2) Right click the "IP Profiles folder and mouse over "Add" and select "IP Profile..."



Name the IP Profile whatever you like and select "OK" to create it.



This will create a blank IP Profile in the IP Profiles folder.



Before going any further, the first thing you will want to do is select a Test Batch. Using the Batch selector, select a Test Batch from the dropdown window. This will give you something to work with when testing out your IP Profile.



Add IP Steps

To add a new step, press the "Add" button and choose the IP Command you wish to use. For this example, we are adding an "Auto Deskew" step.


Verify the Results

To verify the results of the IP Profile, press the "Execute" button.


Each Step will appear in the Diagnostics Panel with useful images for configuring the step's properties. There will also be a before "Input Image" and after "Output Image" both for each individual step and the entire IP Proile.



! Pressing the "Execute" button will not actually modify the selected document. All alterations to the documents in the Test Batch are done in memory when configuring an IP Profile. Furthermore, any time you navigate to another document the IP Profile will "execute" on the temp batch (In other words, there's no need to press the Execute button unless you want to verify the results of the IP Profile on the currently selected document.
FYI You can apply the IP Profile from the IP Profile tab to the selected document using the "Save Processed Page" button, if you so choose.

Sample Configuration of an IP Profile for Permanent Image Processing

Permanent IP makes alterations to the archival version of the exported document. As permanent image processing is, after all, permanent, you must be careful about which commands you use. Most often, IP Profiles for Permanent IP are fairly small, performing only moderate adjustments to the image. More drastic alterations that improve OCR but dramatically alter the image are left for Temporary IP Profiles.

Before anything, you need a sample document set of the kinds of documents targeted by your IP Profile. Once you have these documents ready in a Test Batch, it's a good idea to evaluate them, getting an idea about the kinds of issues the document set has. Below are the documents we will be looking at and some of the issues involved with them.

One of the most common issues resolved by a permanently applied IP Profile is a skewed image. Not only will you improve the image visually for human readers, but deskewing an image is also critically important for OCR to run correctly. This document does not look too bad, but the background isn't totally white. Some of the paper's texture came through when the document was scanned. A lot of scanners have built in image processing functions that will clean this up, but this functionality is somewhat basic and can cause problems down the road. Grooper's IP capabilities are much more robust, and there may be cases where not cleaning up the image actually produces a better result (more on that later).


This image is upside down. As a rule of thumb, if a document is hard for a human to read, it'll be hard for an OCR Reader to recognize the document's characters. There's not too much to fix on this document, but keep in mind an IP Profile will be applied to all documents. Part of the IP Profile creation will be to make sure what positively impacts one document doesn't negatively impact another.


Border cropping is very common when permanently applying an IP Profile. Black borders around images are typically a result of scanning the image and not actually part of the document. Border Crop or Border Fill can get rid of that border, giving an image that is more representative of the actual document. All those black pixels around the edge of the document are also most definitely getting in the way of OCR. However, you might have to ask yourself the question: Is the border part of the original document? If so should we keep the border for the archival export of the document? Take a look at the transcript on the left. At first glance, this border looks like it's a border from scanning the image, but it may actually be part of the document.

The first step we will add is an "Auto Border Crop" command. This command will only crop an image if it detects a border in an established zone. So, only those documents with borders should be affected.

Press the "Add" button and select "Auto Border Crop" from the "Border Cleanup" category.



For our document, the basic settings did not alter the image. You can see this viewing the "Processing Results". The "Image Modified" column has a value of "False". Also there is no "Output Image" in the Diagnostics Panel. Furthermore, you can see clearly there is still a black border on the image in the Document Viewer.



This is where the step's property panel can come in very helpful. The border is quite larger than normal. Changing the "Border Region Size" to 0.75 in on all four sides will properly crop this document. The diagnostic images are useful when configuring any IP Step's properties. In the case of Auto Border Crop, you'll want to use the "Zoning" image to configure the zone where the border falls.

The Border Region zone is shown by the thin red rectangle. It encapsulates some, but not all of the border (from the edge of the document to the edge of the Border Region) using the default of 0.25 in.


By increasing the Border Region Size to 0.75 in you can see the zone now entirely encapsulates the border.


Now, you can see the step removes (most of) the border.



The remaining part of the border can be removed using the "Border Fill Settings" or adding an "Border Fill" step.

Next, we need to take care of the upside down image. This can be done with an "Auto Orient" step. This step will automatically detect the orientation of the image based on the text on the page. If the page is upright, Auto Orient will detect the text reads like normal and do nothing. If the command detects the text is upside down, it will re-orient the page, turning it right-side up.

Press the "Add" button and select "Auto Orient" in the "Image Transforms" category.



Upon execution, the upside down page has been righted.


Next we need to take care of the skewed text. For this, we will use the Auto Deskew command.

Press the "Add" button and choose "Auto Deskew" from the "Image Transforms" category.



This will resolve issues around skewed documents.


Further Decision Making

The previous three IP Commands are very commonly found in Permanent IP Profiles. Since they determine automatically if a page is skewed, has a border, or is oriented incorrectly, there is significantly less risk that these commands will negatively impact the rest of the document set (less risk, but not none!). For any other IP Commands, you have one major question to ask.

How much do you want to alter the original image?

Anything you do with a Permanent IP Profile will affect the final document you send out during export. Keep in mind, there are ways to temporarily apply an IP Profile to improve OCR Results later.

When making your choices on what other IP Commands to add to a Permanent IP Profile you should focus on altering the document for human readability without negatively impacting machine readability (i.e. OCR results).


For example, the "Brightness Contrast" Color Adjustment command can be very helpful to increase both the human and machine readability for some documents. It modifies the brightness or contrast of an image (or both).

Increasing the Brightness property to "10" and the Contrast property to "40" cleans up the first page of this document nicely. The background has been removed while preserving the text.


However, you be sure to verify it does not negatively impact other pages.

On the second page, this command also starts to remove the table lines on the page. Grooper can use the positions of these lines for certain data extraction methods, such as Table Extraction. However, if you get rid of them with a Permanent IP Profile, you won't be able to find those lines later.


Furthermore, the IP Profile runs on all documents in the batch. Be sure it isn't adversely affecting other documents.

This page is fairly light to begin with. While the typewritten text is still there, the handwritten text is starting to fade. OCR Engines may not do well recognizing handwritten text, but if you ever want a human to look at this document and be able to read the handwritten text, increasing the contrast is not well suited for this document.

For situations like this, you may need to find a happy middle ground.

The "Contrast Stretch" command is often used to help improve the image quality of documents. It works to normalize the image's contrast. It adjust the contrast so that the lightest pixels are turned pure white and darkest are turned pure black.

It doesn't do quite as good of a job as the "Brightness Contrast" command but it does brighten up the whites and darken the black parts of the image a bit.
And it does so without losing the handwritten text on this document.

Just as you need to think about how one command will adversely effect other documents in the set, you can take advantage of documents that are very different from others in the set.

This document is the only color document and has some significant issues with how it was scanned. Likely it's actually a picture taken of a screen. Portions of the image are light portions are dark. It's white balance is all off. Since all the other documents are black and white or greyscale, we can reliably use an "Auto White Balance" command to make it look a little better.


Furthermore, there are some ways to leverage image based information to create conditional logic around what steps to execute in an IP Profile.

Make Adjustments

It's very rare when you make an IP Profile and everything works perfectly without doing some unit testing and adjusting some properties on a step or two. Take our "Auto Border Crop" step and our two documents with borders. As we configured the step, this is the results of our two transcripts.



Depending on what route we want to take, there's still some cleaning up on these borders we can do. The easiest thing to do at this point would be to add a "Border Fill" command to clean up the black edges on the left side of these documents.

Press the "Add" button and select "Border Fill" under the "Border Cleanup" category.



One of the main adjustments you will make when using a Border Fill command is setting the "Method" property. By default, it is set to "Exclusive". This means anything fully outside the border zone (seen in the image below in red) will be filled with the selected "Fill" color.



Note the red zone intersects the black border for this document. If we set the method to "Inclusive" it will include borders that overlap the border zone, dropping them out.



But what if we start looking at these documents and don't actually want to remove the border for the transcript on the left but do want to remove the border for the right?


Keep this border Remove this border


For this document set, there's a trick we can do using the properties of "Auto Border Crop" and "Border Fill" to do this. First we will configure "Auto Border Crop" to crop the blue transcript and not the brown one. Looking at the brown transcript, that border is actually part of the document. As such, there is a sliver of white pixels around the document. We can use the "Maximum Border Weight" property to only drop out perfectly solid borders.

Navigate to the "Auto Border Crop" property and change the "Maximum Border Weight" from 90% to 100%.



However, since the blue transcript does have a solid black border around the image, it does crop the image. Or at least it mostly does. There is a slight border on the left and right sides still, but we'll fix that next.



Now, if we go back down to our "Border Fill" command and adjust the "Border Region Size" to "15pt" these left and right borders will be totally outside the border zone. We can set the "Method" back to "Exclusive" now and these borders will be dropped out.



However, for the brown transcript, the border intersects the border zone. So, using the "Exclusive" method keeps the border from being dropped out.



We are left with two originally bordered images, one of which was removed and one of which was not.


Granted, this really only worked because of how these documents came into Grooper. We got lucky in that there was a slight amount of white pixels on each edge of the brown transcript, making the "border" not perfectly solid. That being said, a lot of how you configure Grooper's properties to target certain documents and not others is based of analyzing certain aspects of the documents. We wouldn't have even known to try this approach if we hadn't noticed that border on the brown transcript wasn't a true border.

Sample Configuration of an IP Profile for Temporary Image Processing

For temporary image processing, we don't need to be concerned with how this image will look upon export. We only need to concern ourselves with cleaning up the image to improve OCR results. The general plan is (as much as possible) get rid of anything on the page that is not text. This way non-text artifacts on the page will not interfere with the OCR Engine recognizing actual text characters.

Before anything, you need a sample document set of the kinds of documents targeted by your IP Profile. Once you have these documents ready in a Test Batch, it's a good idea to evaluate them, getting an idea about the kinds of issues the document set has. Below are the documents we will be looking at and some of the issues involved with them.

This document is full of interference for OCR. Table lines, check boxes, and partially shaded headers can all cause problems for accurate OCR results. Furthermore, we likely will want to use the line and box positions later during data extraction. A Temporary IP Profile can be configured to remove this elements but store their locations in memory for later use. This too is filled with table lines, as well as having "negative regions", portions of white text on a black background. OCR must be able to read black pixels. So, we will switch the white text to black during image processing. It will also need to be turned into true black and white instead of greyscale.


This document may seem simple, but again, each command runs on each document in the document set. We will need to make sure the IP Profile works for each document in the set.

OCR absolutely must work with a black and white image. While OCR Engines will turn image black and white on their own, they don't always do a great job at it. The vast majority of Temporary IP Profiles will contain a "Threshold" or "Binarize" step to convert color and grayscale images into true black and white.

Conditional IP

What happens when you have documents in your document set that just don't fit your IP Profile? Perhaps one configuration of a Box Removal command works for most of the documents, but there's one type of document that needs an entirely different configuration? What happens if most of the documents in your set perform well using the standard "Auto" thresholding method, but one works better using "Adaptive"?

In these situations, you may be able to use conditional logic via the "Should Execute" and "Next Step" expressions on IP Steps and IP Groups in an IP Profile. These expressions allow us to use snippets of .NET code to access information about the image or steps in the profile, and use them to determine if and when a step should run in an IP Profile.

Example: Should Execute Based on the Success of the Previous Step

One thing you might want to do in an IP Profile is create a logical order of operations where if the first step fails, a second "fail safe" step should run before moving on to the next step. The logic might look something like this:

IF "Step 1" succeeds THEN go to "Step 3"

IF "Step 1" fails THEN go to "Step 2" (Then "Step 3" will run like normal after "Step 2" is finished)

Consider the following example for "Box Removal"

The following example uses these documents. We want to make a simple IP Profile using "Box Removal" and "Line Removal" commands. However, as we will see, we will need conditional logic, using a "Should Execute" expression, to avoid certain issues.


Configure the First Box Removal

First, let's add a "Box Removal" and "Line Removal" command. Both are found in the "Feature Removal" category.



Selecting the first document, "Application for Cow Ownership", there are a few configurations we need to take into consideration. First, there are some very small boxes on this document. So, we need to adjust the "Size Range" property. Change the "Size Range" from "7pt - 16pt" to "6pt - 16pt" (Adjust the "Minimum" sub property to "6pt").

This will get all the tiny checkboxes on the document. However, we've got an unintended consequence. It's also removing the two "o"s in "Houston".



Not a huge deal. They are being removed because of some of our box detection properties, specifically the "Minimum Aspect Ratio". The aspect ration properties define the "squareness" of a box. At 100%, only perfectly square boxes will be removed. The default of 75% for the Minimum Aspect Ratio allows for variation in how "skinny" the rectangle is. It will allow boxes to be detected if they are 75% as wide as they are high. However, the boxes on this document are very square. If we increase this property to 90%, we are in the clear. The "o"s are no longer detected as boxes and therefore, not removed.


Configure the Second Box Removal

However, for the second document, the "Box Removal" command won't work. These boxes are indeed skinnier than normal squares. So our configurations will not work.

Go ahead and add a second "Box Removal" command and move it between the first and last steps.



To target these boxes we will need to decrease the Minimum Aspect Ratio. Adjusting this to 70% will detect and remove these boxes. Also, these boxes are on the smaller side. So we will need to adjust the Minimum Size to 6pt as well.



However, now this second "Box Removal" command is removing the "o"s in "Houston" on the first document.



No need to worry. We can use a "Should Execute" expression to run the first Box Removal on the first document, but the second Box Removal on the second.

Edit the Should Execute Expression

The conditional expressions are on the "Selected Step" tab of a selected step. Select the second "Box Removal" command, and switch to the "Selected Step" tab.



Select the "Should Execute Expression" property and press the ellipsis button at the end.



This will bring up the "Should Execute Expression" editor. Here you can write a snippet of code to either execute the selected property or not. If the expression evaluates to "true" the step will run. If it returns as "false", it will not.



From here, we need to make an expression based on the results of the previous Box Removal command. If the previous Box Removal did not find any boxes, we want this Box Removal to execute. Otherwise, we want to use the results of the first Box Detection. This will get around the second Box Detection command running on the first document and removing the "o"s in "Houston".

We can use the following expression: Results.Box_Removal.Boxes.Count = 0

This will return "True" if the Box Removal command before it found 0 boxes. As long as the Box Removal command found a single box, this second Box Removal command will not run.



Now, only the first Box Removal command runs on the first document, skipping over the second step entirely.



But, for the second document, since no boxes were detected using the first Box Removal command, the second step runs!




Example: Should Execute Based on Classify Image

Different IP Steps, IP Groups or even entire IP Profiles can be executed based on the results of the "Classify Image" command. The "Classify Image" command compares an image against a set of sample images and classifies the image based on which sample it is most similar to. It does this by analyzing the color space of an image. For example, the RBG color space is made up of a red channel, a green channel, and a blue channel. The similarity would then be based on how similar the information in these three channels is to another image. For example, if a sample image has a high value in the red channel but a low value in the blue channel, it would not match an image that has a high blue channel but a low red channel.

For this example, we will create a conditional expression for thresholding these two documents.



The blue transcript is a good candidate for using "Adaptive" thresholding over the "Auto" method.

Auto Adaptive


However, the brown transcript is handled better by the "Auto" thresholding method.


Auto Adaptive


We will use the "Classify Image" command to have one image use "Auto" and the other "Adaptive".

Add IP Steps

This IP Profile will have three IP Steps: One "Classify Image" and two "Threshold" commands, one of which uses the "Auto" method and the other which uses the "Adaptive" method. Here, the two "Threshold" commands have been renamed accordingly.


Give Classify Image Sample Images

Select the "Classify Image" step and navigate to the "Sample Images" property. Press the ellipsis button at the end of the property.



This will bring up the "Sample Images" window. Press the "Add" button to add a new sample image.



From here, select a sample from your test batch. For this example, we are selecting this blue transcript. Give this sample a name using the "Sample Name" box. Press "OK" when finished.



This will add the image to the list of Sample Images on the left. When finished adding sample images, press "Done"


Select the Color Space to Analyze

Next, select the color space you wish to use to classify the image. There are a variety of color space options, each of which measures different channels making up a document's color. For this example we are using the HSV color space, which measures hue, saturation, and value (pixel intensity).



Classify Image's Execution Log shows the measurements for the selected color channel and how similar they are to the sample image. All color spaces will have "Channel 1", "Channel 2", "Channel 3" and "Entropy" listed under "Source Image Features" These are the image based measurements of the selected image. The three channels correspond to the information in the three channels of the selected color space. For HSV, "Channel 1" is the hue, "Channel 2" is the saturation, and "Channel 3" is the value. "Entropy" is a measure of how "busy" the image is. The more black text on a document, the higher the entropy measure will be.

The similarity score to each trained image is seen under "Results for Trained Image Classes". We only trained one image, the "Blue Transcript", which is coming in as "99.77%" similar.

The image is assigned a classification based on how similar the three channels and the entropy are to the trained images (assuming it meets the "Minimum Similarity" score). This image is classified as "Blue Transcript". This can be verified by its "Class Name".



For the brown transcript, it was different enough from the trained image that it did not meet the minimum similarity of 85%. Hence it received no classification. This can be verified by its "Class Name", which is "None".



Now that we have a benchmark that can tell one image from the other, we can use it to conditionally threshold the image.

Set the Should Execute Expression

Next we need to figure out and apply our logic for thresholding these documents. If the image is classified as "Blue Transcript" we want to use the Adaptive method. Otherwise, we want to use the Auto method. The next step should be the Threshold (or Binarize) command using Adaptive thresholding.

Select the "Threshold - Adaptive" step. Navigate to the "Selected Step" tab. Select the "Should Execute Expression" property and press the ellipsis button at the end.



The expression we will write will reference the results of the "Image Command". The basic idea here is if the image was classified as "Blue Transcript" execute the command, otherwise do not. The code expression we can use is as follows: Results.Classify_Image.ClassName = "Blue Transcript"

Note, the class name you enter must match the sample image's name exactly. Press the "OK" button when finished.



Success! The blue transcript was turned black and white using the Adaptive method. However, the real test will be if the brown transcript skipped the "Threshold - Adaptive" step and went straight to the "Threshold - Auto" step.



! Although it may appear as if the "Threshold - Auto" step is skipped, all steps after the step using the "Should Execute Expression" are still applied. It only appears as if it doesn't run because it's being handed an image that is already black and white. The next step technically still runs. Should Execute Expressions only determine if a step is applied or not. They do not have anything to do with the order in which other steps are applied (The "Next Step Expression" can determine order).


Did the step execute on the brown transcript, which was not classified as the "Blue Transcript" image? It did not! The "Threshold - Adaptive" step was skipped and the next step in the sequence, "Threshold - Auto" ran as normal.


Example: IP Groups and the Next Step Expression