A General Segmentation Scheme for DjVu Document Compression

Abstract: We describe the “DjVu” technology: an efficient document image compression methodology, a file format, and a delivery platform that together, enable instant access to high quality documents from essentially any platform, over any connection. Originally developed for scanned color documents, it was recently expanded to electronic documents, so DjVu has now truly become a universal document interchange format. With DjVu, a color magazine page scanned at 300dpi typically occupies between 40KB and 80KB, i.e. approximately 5 to 10 times smaller than JPEG for a similar level of readability (the typical compression ratio is 500:1). Converting electronic documents to DjVu also offers substantial advantages, as described in the paper. The technology relies on a classification of each pixel as either foreground (text, drawing) or background (pictures, paper texture and color), thereby producing a segmentation into layers that are compressed separately. The novel contribution of this paper is a unified approach for segmentation of scanned or electronic documents, using a rigorous approach based on the Minimum Description Length (MDL) principle. The foreground layer is compressed using a pattern matching technique taking advantage of the similarities between character shapes. A progressive, wavelet-based compression technique, combined with a masking algorithm, is then used to compress the background image at lower resolution, while minimizing the number of bits spent on the pixels that are otherwise covered by foreground pixels. Encoders, decoders, and real-time, memory efficient plug-ins for various web browsers are available for all the major platforms.

Patrick Haffner, Léon Bottou , Yann Le Cun and Luc Vincent: A General Segmentation Scheme for DjVu Document Compression, Proceedings of the International Symposium on Mathematical Morphology (ISMM'02), CSIRO publications, Sydney, Australia, April 2002.

ismm-2002.djvu ismm-2002.pdf ismm-2002.ps.gz

  author = {Haffner, Patrick and Bottou , L\'{e}on and {Le Cun}, Yann and Vincent, Luc},
  title = {A General Segmentation Scheme for {DjVu} Document Compression},
  booktitle = {Proceedings of the International Symposium on Mathematical Morphology (ISMM'02)},
  publisher = {CSIRO publications},
  address = {Sydney, Australia},
  month = {April},
  year = {2002},
  url = {http://leon.bottou.org/papers/haffner-2002},