2.80:Extract Page (IP Command): Difference between revisions

From Grooper Wiki
Created page with "{|style="margin:right" |- |file:1557843133621-121.png |- |file:1557844395737-263.png |} <blockquote style="font-size:14pt">'''Export Providers''' are used by the Docu..."
 
No edit summary
 
(15 intermediate revisions by 3 users not shown)
Line 1: Line 1:
{|style="margin:right"
{{AutoVersion}}
 
[[file:1573742604267-578.png|frame]]
 
<blockquote>{{#lst:Glossary|Extract Page}}</blockquote>
 
A carrier image, for our purposes here, is simply the original input image containing a document.  If you've ever deposited a check from a mobile phone, you've sent your bank a carrier image of the check.  The document, in this case a check, is "carried" to another application where the check is removed from the background and straightened out to extract the deposit amount and account number.
 
Extract Page works much the same way.  This IP Command extracts a page from a dark background or a light background where border edges are visible.  It detects the edges of a page forming a quadrilateral shape, “cuts out” the page from the background.  At the same time it repairs any skew, shear or perspective warping, producing a straight, flat image.
<br clear = all>
 
== About ==
Below is an example of how Extract Page works.  The document on the left is from the carrier image.  It is skewed and has a hefty black border around it.  The document on the right is the result of the Extract Page operation.  The document is straightened out and the border is removed.
 
{|style="text-align:center; margin:auto" cellspacing="10"
|-
|The original image<br />[[file:1573659273321-758.png|center|400px]]||The extracted image<br />[[file:1573832146767-464.png|center|400px]]
|}
 
== Version Differences ==
 
Extract Page is new to version 2.80.  In previous versions, it would be necessary to approximate this functionality through multiple IP Commands, including Auto Border Crop, Auto Deskew, and Warp.
 
== Use Cases ==
 
[[file:microfiche.jpg|thumb|Documents on a microfiche card]]
 
Extract Page was developed specifically for Grooper's microfiche processing capabilities.  Documents exist on a microfiche card in rows and columns.  Grooper's microfiche activities make individual document images out of the matrix of documents on the fiche card.  However, a slight black border persists to account for variations on how the documents were originally reproduced on the fiche card.  Furthermore, since a fische card is created by taking a film image of documents, it is common for slight skewing and warping to occur, especially towards the edges of the film.  The Extract Page command resolves both these issues.
 
That does not mean the command is limited to microfiche processing.  Extract page could be used any time documents need to be removed from a dark background (or a light background where the edges of the document are discernible from the background), and warped or skewed images need adjusting.
 
<br clear = all>
 
== How To:  Add the command to an IP Profile ==
 
<tabs>
<tab name="Prereq">
==== Before you begin ====
 
This guide assumes you've created an IP Profile and have a Test Batch ready to configure the Extract Page command.
 
</tab>
<tab name="Step 1">
==== Add Extract Page to the IP Profile ====
 
1. Navigate to your IP Profile in the "IP Profiles" folder in the "Global Resources" folder in the Node Tree.
 
2. Press the "Add" button to add a new IP Command to your IP Profile.
 
 
[[file:1573832404164-783.png|center]]
 
 
3. Select the "Image Transforms" category. Then, select "Extract Page".
 
 
[[file:1573832950498-787.png|center]]
 
 
</tab>
<tab name="Step 2">
==== Turn the document black and white & define its borders ====
 
===== Binarization =====
 
 
[[file:1573833028046-933.png|center]]
 
 
1. In order for Extract Page to work, color and grayscale images must first be "binarized", or converted to black and white.  Binarization converts color images to black and white by "thresholding" the image.  Thresholding is the process of setting a threshold value on the pixel intensity of the original image.  Pixel intensity is a pixel's "lightness" or "brightness".  Essentially, once a midpoint between the most intense ("whitest") and least intense ("blackest") pixel on a page is established, lighter pixels are converted to white and darker are converted to black.  Or put another way, pixels with an intensity value above the threshold are converted to white, and those below the threshold are converted to black.  This midpoint (or "threshold") can be set manually or found automatically by a software application.  The Thresholding Method can be set to one of four ways: 
 
* Simple - Thresholds an image to black and white using a fixed threshold value between 0 and 255.
* Auto - Selects a threshold value automatically using Otsu's Method.
* Adaptive - Thresholds pixels based on the intensity of pixels in the local neighborhood.
* Dynamic - Performs adaptive thresholding, while preserving dark areas on the page.
 
Each method has its own configurable properties.  For more information on binarization and these methods, visit the [[Binarize]] article.
 
===== Border Size =====
 
 
[[file:1573836254056-603.png|center]]
 
 
2. You also must set where on the page edge detection will be performed.  This is done using the "Border Size" property.  The idea here is to define a rectangle that will fall inside the page you want to extract.  When configuring this property, use the "Zoning" diagnostic image.  The blue rectangle is the defined border region.
 
{|cellspacing="10" style="margin:auto"
|-
|style="width:50%"|The default Border Size of 0.25 inches from each edge is not going to work here.  The region only overlaps part of the document.||Here, the Border Size is set to 0.5 inches from each edge.  The region completely overlaps the document.  With further configuration, Extract Page will work successfully with this Border Size.
|-
|[[image:1573833789329-699.png|center|450px]]||[[image:1573833796480-805.png|center|450px]]
|}
 
 
</tab>
<tab name="Step 3">
==== Set line detection settings ====
 
The image is extracted by first finding the document's edges.  Once Grooper knows where the lines around a document are inside the carrier, it can digitally cut around those lines to remove it.  There are two configurable properties for line detection, "Angle Precision" and "Threshold".
 
===== Angle Precision =====
 
 
[[file:1573836517291-285.png|center]]
 
 
"Angle Precision" sets the angle increment for each of the four lines that make up the extracted image.  "1/64 degrees" is the most precise.  "1 degree" is the least.  In the example below, angle precision was first set to "1 degree".  Only the right edge was detected because it was at least a full degree's difference from vertical.  All the other three edges are less than a full degree's difference from horizontal or vertical.  The operation could not reliably determine their angle, and they were not detected.  When set to "1/4 degree" precision, it found the remaining three sides, and the image could be extracted.
 
{|style="text-align:center; margin:auto"" cellspacing="10"
|-
|1 degree angle precision. Only one line found (Seen in red).||Page cannot be extracted
|-
|-
|[[file:1557843133621-121.png]]
|[[image:1573598823380-456.png|center|450px]]||[[image:1573598833416-651.png|center|450px]]
|-
|-
|[[file:1557844395737-263.png]]
|1/4 degree angle precision.  All four lines found (Seen in red).||Page is extracted.
|-
|[[image:1573598828027-424.png|center|450px]]||[[image:1573598838574-603.png|center|450px]]
|}
|}


<blockquote style="font-size:14pt">'''Export Providers''' are used by the Document Export activity to define where and how Grooper content is exported.</blockquote>


What Export Provider you choose will determine where your processed content in Grooper (including documents and their metedata) are exported.  They allow you to define the connection type when exporting outside of Grooper, such as file systems, content management repositories, or mail servers. 
===== Threshold =====


There are two categories of Export Providers: CMIS Export and Legacy Export


== CMIS Export ==
[[file:1573836429089-612.png|center]]


CMIS Export has two different provider types:


* Mapped Export
"Threshold" determines what should count as a line when detected by setting a percentage of the image width or height the line must occupy to be considered an edge.  The default setting here is 75%.  So, if the left (vertical) line of the image you want to extract is at least 75% as long as height of the whole image, the page will extract.
* Unmapped Export


These export providers use connections to storage platforms using Grooper's CMIS+ architecture. Connections are created as "bindings" modeled after the CMIS standardCMIS Connections can be created to the following storage platforms for Mapped and/or Unmapped Export.
{|cellspacing="10" style="margin:auto"
|-valign="top"
|Again, the "Zoning" diagnostic image will help you configure this setting. Detected lines are seen in red.  You want all four edges of the extracted image lined with a red line.  Here, the image is (roughly) 4" by 5" and the white rectangle is 3" by 4"The top and bottom edge of the white rectangle is 75% the width of the image.  The left and right edges are 80%.  So, a Threshold of 75% should work here.||[[image:1573835538057-538.png]]
|}


* The ApplicationXtender document management platform.
* The FileBound document management platform.
* Content management systems using CMIS 1.0 or CMIS 1.1 servers.
* The following Microsoft content platforms
** The Microsoft Exchange mail server platform.
** The Microsoft OneDrive cloud storage platform.
** Microsft SharePoint sites.
* FTP (File Transfer Protocal) and SFTP (SSH File Transfer Protocal) servers.
* IMAP (Internet Message Access Protocol)mail servers
* The Microsoft Windows NTFS file system.


Mapped Export differs from Unmapped Export in that information from a Content Model, such as a document's classification type and extracted data, can be exported as well (The metadata is "mapped" from Grooper to the storage platform)Unmapped Export is a simpler export.  Processed files are simply exported according to the format you choose in Grooper (such as PDF or TIFF)Metadata is not mapped to any location in the storage platformHowever, a metadata buddy file can be created alongside the document (such as a CSV or XML file).
{|cellspacing="10" style="margin:auto"
|-valign="top"
|As a word of caution, if the image is sheared, rotated, or otherwise warped, you may need to lower the threshold lower than you may thinkThis is the same page, same size, sheared slightly to the leftThe Threshold had to be lowered to 69% in order to extract the pageAlso note, warped images like these are where finer Angle Precision can help improve Extract Page's results.||[[image:1573835871816-103.png]]
|}


== Legacy Export ==


These providers are the export providers available prior to the switch to Grooper's CMIS+ architecture in version 2.72.  They are still available for use, but are depreciated.  No further improvements or updates will be made to these provider types.  There are four Legacy Export providers.
</tab>
<tab name="Step 4">
==== De-warp the image ====


* File System Export -This provider exports content over a local network to the Windows file system.
After Extract Page can find the document's lines, it will cut it out of the carrier image. The image is then automatically de-warped.  Any skewing, shearing, or rotation will automatically fixed. There are three different "Interpolation Modes" to choose from: Linear, Cubic or NearestNeighbor.
* FTP Export - This provider exports content using the FTP and FTPS protocols.
* Mail Export - This provider exports content to mail servers using the IMAP protocol.
* SFTP Export - This provider exports content using the SFTP protocol.


Note the four Legacy Export providers also can be created as CMIS Connection TypesFTP, SFTP, IMAP and Windows file system exports can all be done using a CMIS ExportIt is best practice after version 2.72 to use CMIS Export over Legacy Export providers.
[[file:1573836834194-409.png|center]]
 
 
This choice effects the speed at which the image is de-warped and the quality of the final image.  The better quality the final image, the slower the operation.  "Cubic" is the slowest but most accurate.  "NearestNeighbor" is the fastest but least accurate.  "Linear" is in between both in terms of speed and accuracy (This is also the default setting).
 
You can really see the difference between NearestNeighbor and the other two below.  If your document is fairly simple like the one we've been using as an example, it's likely there won't be much difference between Linear and CubicHowever, if your document has images or complicated table lines or a lot going on otherwise, Cubic may be necessary if you want the most accurate de-warping performed.
 
 
{|style="text-align:center; margin:auto" cellspacing="10"
|-
|Cubic||Linear||Nearest Neighbor
|-
|[[file:1573838498967-406.png|border|center|300px]]||[[file:1573838502838-341.png|border|center|300px]]||[[file:1573838505844-197.png|border|center|300px]]
|}
 
 
The last property available is "Apply Edge Smoothing".  When images are de-warped, pixels from lines that were straight on the original document are put back together and made straight again.  However, the operation is never 100% perfect.  Some distortion will necessarily occur.  Seen above, even the most accurate interpolation mode produces slightly jagged lines.  Edge smoothing attempts to even out these lines while preserving their edge.  This is a "True" or "False" property.  So, if turned on it will either successfully smooth edges or it won't.  There's no configurable properties to get it to work "better".
 
 
[[file:1573839430384-312.png|center]]
 
 
</tab>
</tabs>
 
== Property Details ==
 
{|cellpadding=10 cellspacing=5 style="margin:auto"
|-style="background-color:#36B0A7; color:white"
|style="width:17%"|'''Property'''
|style="width:17%"|'''Default Value'''
|'''Information'''
|-style="background-color:#36B0A7; color:white"
|colspan=3|General Properties
|-style="background-color:#ddf5f5
|Binarization||Auto||Binarization converts color images to black and white by "thresholding" the image.  Thresholding is the process of setting a threshold value on the pixel intensity of the original image.  Pixel intensity is a pixel's "lightness" or "brightness".  Essentially, once a midpoint between the most intense ("whitest") and least intense ("blackest") pixel on a page is established, lighter pixels are converted to white and darker are converted to black.  Or put another way, pixels with an intensity value above the threshold are converted to white, and those below the threshold are converted to black.  This midpoint (or "threshold") can be set manually or found automatically by a software application.  The Thresholding Method can be set to one of four ways:
 
*Simple - Thresholds an image to black and white using a fixed threshold value between 1 and 255.
*Auto - Selects a threshold value automatically using Otsu's Method.
*Adaptive - Thresholds pixels based on the intensity of pixels in the local neighborhood.
*Dynamic - Performs adaptive thresholding, while preserving dark areas on the page.
 
Each method has its own set of configurable properties.  For more information on binarization and these methods, visit the [[Binarize]] article.
|-style="background-color:#ddf5f5
|Border Size||0.25in||Controls the border size for detecting the edges of a document.
|-style="background-color:#36B0A7; color:white"
|colspan=3|Line Detection Properties
|-style="background-color:#ddf5f5
|Angle Precision||1/8||The precision at which edge angles are detected, ranging from 1/64 degrees to 1 degree.
|-style="background-color:#ddf5f5
|Threshold||75%||"Threshold" determines what should count as a line when detected by setting a percentage of the image width or height the line must occupy to be considered an edgeThe default setting here is 75%. So, if the left (vertical) line of the image you want to extract is at least 75% as long as height of the whole image, the page will extract.  (Note, skewed, sheared or otherwise warped image might need a lower threshold than you may think.  The "Zoning" diagnostic image will help configure this property.
|-style="background-color:#36B0A7; color:white"
|colspan=3|Warp Settings
|-style="background-color:#ddf5f5
|Interpolation Mode||Linear||Determines accuracy and speed of the warp operation.  NearestNeighbor is the fastest and least accurate.  Cubic is the most accurate but slowest.  Linear is in between both.
|-style="background-color:#ddf5f5
|Apply Edge Smoothing||False||When images are de-warped, pixels from lines that were straight on the original document are put back together and made straight again.  However, the operation is never 100% perfect.  Some distortion will necessarily occur.
 
Edge smoothing attempts to even out these lines while preserving their edge.  This is a "True" or "False" property.  So, if turned on it will either successfully smooth edges or it won't.  There's no configurable properties to get it to work "better".
|}

Latest revision as of 14:28, 21 November 2024

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

20252.80

Extract Page is an IP Command that removes an image from a carrier image while simultaneously removing any image warping or skewing.

A carrier image, for our purposes here, is simply the original input image containing a document. If you've ever deposited a check from a mobile phone, you've sent your bank a carrier image of the check. The document, in this case a check, is "carried" to another application where the check is removed from the background and straightened out to extract the deposit amount and account number.

Extract Page works much the same way. This IP Command extracts a page from a dark background or a light background where border edges are visible. It detects the edges of a page forming a quadrilateral shape, “cuts out” the page from the background. At the same time it repairs any skew, shear or perspective warping, producing a straight, flat image.

About

Below is an example of how Extract Page works. The document on the left is from the carrier image. It is skewed and has a hefty black border around it. The document on the right is the result of the Extract Page operation. The document is straightened out and the border is removed.

The original image
The extracted image

Version Differences

Extract Page is new to version 2.80. In previous versions, it would be necessary to approximate this functionality through multiple IP Commands, including Auto Border Crop, Auto Deskew, and Warp.

Use Cases

Documents on a microfiche card

Extract Page was developed specifically for Grooper's microfiche processing capabilities. Documents exist on a microfiche card in rows and columns. Grooper's microfiche activities make individual document images out of the matrix of documents on the fiche card. However, a slight black border persists to account for variations on how the documents were originally reproduced on the fiche card. Furthermore, since a fische card is created by taking a film image of documents, it is common for slight skewing and warping to occur, especially towards the edges of the film. The Extract Page command resolves both these issues.

That does not mean the command is limited to microfiche processing. Extract page could be used any time documents need to be removed from a dark background (or a light background where the edges of the document are discernible from the background), and warped or skewed images need adjusting.


How To: Add the command to an IP Profile

Before you begin

This guide assumes you've created an IP Profile and have a Test Batch ready to configure the Extract Page command.

Add Extract Page to the IP Profile

1. Navigate to your IP Profile in the "IP Profiles" folder in the "Global Resources" folder in the Node Tree.

2. Press the "Add" button to add a new IP Command to your IP Profile.



3. Select the "Image Transforms" category. Then, select "Extract Page".



Turn the document black and white & define its borders

Binarization


1. In order for Extract Page to work, color and grayscale images must first be "binarized", or converted to black and white. Binarization converts color images to black and white by "thresholding" the image.  Thresholding is the process of setting a threshold value on the pixel intensity of the original image.  Pixel intensity is a pixel's "lightness" or "brightness".  Essentially, once a midpoint between the most intense ("whitest") and least intense ("blackest") pixel on a page is established, lighter pixels are converted to white and darker are converted to black.  Or put another way, pixels with an intensity value above the threshold are converted to white, and those below the threshold are converted to black.  This midpoint (or "threshold") can be set manually or found automatically by a software application.  The Thresholding Method can be set to one of four ways: 

  • Simple - Thresholds an image to black and white using a fixed threshold value between 0 and 255.
  • Auto - Selects a threshold value automatically using Otsu's Method.
  • Adaptive - Thresholds pixels based on the intensity of pixels in the local neighborhood.
  • Dynamic - Performs adaptive thresholding, while preserving dark areas on the page.

Each method has its own configurable properties. For more information on binarization and these methods, visit the Binarize article.

Border Size


2. You also must set where on the page edge detection will be performed. This is done using the "Border Size" property. The idea here is to define a rectangle that will fall inside the page you want to extract. When configuring this property, use the "Zoning" diagnostic image. The blue rectangle is the defined border region.

The default Border Size of 0.25 inches from each edge is not going to work here.  The region only overlaps part of the document. Here, the Border Size is set to 0.5 inches from each edge.  The region completely overlaps the document.  With further configuration, Extract Page will work successfully with this Border Size.


Set line detection settings

The image is extracted by first finding the document's edges. Once Grooper knows where the lines around a document are inside the carrier, it can digitally cut around those lines to remove it. There are two configurable properties for line detection, "Angle Precision" and "Threshold".

Angle Precision


"Angle Precision" sets the angle increment for each of the four lines that make up the extracted image. "1/64 degrees" is the most precise. "1 degree" is the least. In the example below, angle precision was first set to "1 degree". Only the right edge was detected because it was at least a full degree's difference from vertical. All the other three edges are less than a full degree's difference from horizontal or vertical. The operation could not reliably determine their angle, and they were not detected. When set to "1/4 degree" precision, it found the remaining three sides, and the image could be extracted.

1 degree angle precision. Only one line found (Seen in red). Page cannot be extracted
1/4 degree angle precision.  All four lines found (Seen in red). Page is extracted.


Threshold


"Threshold" determines what should count as a line when detected by setting a percentage of the image width or height the line must occupy to be considered an edge. The default setting here is 75%. So, if the left (vertical) line of the image you want to extract is at least 75% as long as height of the whole image, the page will extract.

Again, the "Zoning" diagnostic image will help you configure this setting. Detected lines are seen in red. You want all four edges of the extracted image lined with a red line. Here, the image is (roughly) 4" by 5" and the white rectangle is 3" by 4". The top and bottom edge of the white rectangle is 75% the width of the image. The left and right edges are 80%. So, a Threshold of 75% should work here.


As a word of caution, if the image is sheared, rotated, or otherwise warped, you may need to lower the threshold lower than you may think. This is the same page, same size, sheared slightly to the left. The Threshold had to be lowered to 69% in order to extract the page. Also note, warped images like these are where finer Angle Precision can help improve Extract Page's results.


De-warp the image

After Extract Page can find the document's lines, it will cut it out of the carrier image. The image is then automatically de-warped. Any skewing, shearing, or rotation will automatically fixed. There are three different "Interpolation Modes" to choose from: Linear, Cubic or NearestNeighbor.


This choice effects the speed at which the image is de-warped and the quality of the final image. The better quality the final image, the slower the operation. "Cubic" is the slowest but most accurate. "NearestNeighbor" is the fastest but least accurate. "Linear" is in between both in terms of speed and accuracy (This is also the default setting).

You can really see the difference between NearestNeighbor and the other two below. If your document is fairly simple like the one we've been using as an example, it's likely there won't be much difference between Linear and Cubic. However, if your document has images or complicated table lines or a lot going on otherwise, Cubic may be necessary if you want the most accurate de-warping performed.


Cubic Linear Nearest Neighbor


The last property available is "Apply Edge Smoothing". When images are de-warped, pixels from lines that were straight on the original document are put back together and made straight again. However, the operation is never 100% perfect. Some distortion will necessarily occur. Seen above, even the most accurate interpolation mode produces slightly jagged lines. Edge smoothing attempts to even out these lines while preserving their edge. This is a "True" or "False" property. So, if turned on it will either successfully smooth edges or it won't. There's no configurable properties to get it to work "better".



Property Details

Property Default Value Information
General Properties
Binarization Auto Binarization converts color images to black and white by "thresholding" the image. Thresholding is the process of setting a threshold value on the pixel intensity of the original image.  Pixel intensity is a pixel's "lightness" or "brightness".  Essentially, once a midpoint between the most intense ("whitest") and least intense ("blackest") pixel on a page is established, lighter pixels are converted to white and darker are converted to black.  Or put another way, pixels with an intensity value above the threshold are converted to white, and those below the threshold are converted to black.  This midpoint (or "threshold") can be set manually or found automatically by a software application. The Thresholding Method can be set to one of four ways:
  • Simple - Thresholds an image to black and white using a fixed threshold value between 1 and 255.
  • Auto - Selects a threshold value automatically using Otsu's Method.
  • Adaptive - Thresholds pixels based on the intensity of pixels in the local neighborhood.
  • Dynamic - Performs adaptive thresholding, while preserving dark areas on the page.

Each method has its own set of configurable properties. For more information on binarization and these methods, visit the Binarize article.

Border Size 0.25in Controls the border size for detecting the edges of a document.
Line Detection Properties
Angle Precision 1/8 The precision at which edge angles are detected, ranging from 1/64 degrees to 1 degree.
Threshold 75% "Threshold" determines what should count as a line when detected by setting a percentage of the image width or height the line must occupy to be considered an edge. The default setting here is 75%. So, if the left (vertical) line of the image you want to extract is at least 75% as long as height of the whole image, the page will extract. (Note, skewed, sheared or otherwise warped image might need a lower threshold than you may think. The "Zoning" diagnostic image will help configure this property.
Warp Settings
Interpolation Mode Linear Determines accuracy and speed of the warp operation. NearestNeighbor is the fastest and least accurate. Cubic is the most accurate but slowest. Linear is in between both.
Apply Edge Smoothing False When images are de-warped, pixels from lines that were straight on the original document are put back together and made straight again. However, the operation is never 100% perfect. Some distortion will necessarily occur.

Edge smoothing attempts to even out these lines while preserving their edge. This is a "True" or "False" property. So, if turned on it will either successfully smooth edges or it won't. There's no configurable properties to get it to work "better".