Page Image Compression for Mass Digitization
 

In late 2006, Harvard University Library, the California Digital Library, the Internet Archive, and the Bibliothèque nationale de France conducted a collaborative investigation of the the use of lossy JP2 compression for mass digitization of texts. We documented our findings in the IS&T Archiving 2007 Conference Proceedings.

We encourage you to consult the published paper, or this preprint, when using any of the following Harvard test suite images, or reviewing the evaluation reports.

Harvard University Library Test Suite
Text pages       b/w image
003176581_0007 003176581_008 003298279_0001 006393844_0001_thumbnail 006051784_0002_thumbnail
003176581_0007 003176581_0008 003298279_0001 006393844_0001 006051784_0002
Text + b/w illustration Color images  
003298279_0004 002010967_0026_thumbnail 002024214_0033 006393844_0008
003298279_0004 002010967_0026 002024214_0033 006393844_0008

Production notes for the Harvard test suite

The digitized pages in this suite were selected to represent a segment (but not the full range) of page characteristics for volumes published in the 19th and 20th centuries. This test suite contains nine book pages. Click on any thumbnail to access:

  • a one-page report summarizing evaluations of image quality for JP2 images of various sizes (from the Aware and Kakadu codecs)
  • three baseline images created by HCL Imaging Services:
    • a 300 ppi uncompressed 24-bit RGB TIFF "uncorrected camera" image, captured with a Zeutschel OS10000 Bookscanner
    • a 300 ppi uncompressed "processed RGB TIFF," created from the above image, with an Adobe action script optimized for works on paper (from general library collections such as books and journals). The action script makes global and local color corrections and  tonal adjustments through a combination of curves, levels, and hue/saturation controls. Images were also sharpened through a fairly complex multi step process.
    • a 600 ppi 1-bit TIFF, created from an intermediary 8-bit grayscale TIFF from the "processed RGB TIFF" with Photoshop's default "convert to grayscale" function (30% Red, 60% Green, 10% Blue)
  • three sets of JP2 images: each produced from the same 300 ppi uncompressed processed RGB TIFF
    • one lossless and six lossy JP2 files with the Aware version 3.11.2 command-line codec
    • one lossless and six lossy JP2 files with the Kakadu version 5.5.2 command-line codec
    • one lossless and six lossy JP2 files with the LuraWave command-line codec 2.1.11.05

The IS&T paper provides details of the Peak Signal to Noise Ratio (PSNR) and mean square error (MSE) functions we used in Aware and Kakadu respectively to optimize human-perceived quality of text and illustrations in lossy-compressed images. The LuraWave codec provides quality settings ranging from a low of Q1 to a high of Q100. We encourage you to consult the full paper for an explanation of our research questions, methodology, and findings.

May 2007

Text Digitization Resources           Library Preservation at Harvard home           HUL Office for Information Systems home