Comparing Formats for Still Image Digitizing: Part One

The following is a guest post by Carl Fleischhauer, a Digital Initiatives Project Manager in NDIIPP.

DocCover02The Still Image Working Group within the Federal Agencies Digitization Guidelines Initiative (FADGI) recently posted a comparison of a few selected digital file formats.  We sometimes call these target formats: they are the output format that you reformat to.  In this case, we are comparing formats suitable for the digitization of historical and cultural materials that can be reproduced as still images, such as books and periodicals, maps, and photographic prints and negatives.

This activity runs in parallel with an effort in the Audio-Visual Working Group to compare target formats for video reformatting, planned for posting in the next few weeks.  Meanwhile, there is a third activity pertaining to preservation strategies for born-digital video.  The findings and reports from all three efforts will be linked from the format-compare page cited above.

The two comparisons of digitization formats employ similar, matrix-based tables to compare about forty features that are relevant to preservation planning, grouped under the following general headings:

  • Sustainability Factors
  • Cost Factors
  • System Implementation Factors (Full Lifecycle)
  • Settings and Capabilities (Quality and Functionality Factors)

The still image format-comparison is a joint effort of the Government Printing Office, the National Archives, and the Library of Congress.  The initial posting compares JPEG 2000, “old” JPEG, TIFF, PNG, and PDF, and several subtypes. In time, the findings from this project will be integrated into the Working Group’s continuing refinement of its general guideline for raster imaging.

Speaking for all of the compilers, I will note that we have varying levels of confidence about our findings, and we hope to benefit from the experience and wisdom of our colleagues.  (The FADGI site includes a comment page.  As I was drafting this blog, we received very helpful comments from colleagues at Harvard University.)  The FADGI working group is not alone in parsing this topic.  Members of the digital library community discuss the pros and cons of various still image target formats from time to time.  During the first week of May this year, for example, there was a vivid exchange in the Digital Curation Google Group.

Representation of a raster image with RGB color values. From:  http://en.wikipedia.org/wiki/File:Rgb-raster-image.svg

Representation of a raster image with RGB color values. From: http://en.wikipedia.org/wiki/File:Rgb-raster-image.svg

In this first blog of two, I’ll sketch a bit of background and offer some notes about the tried-and-true TIFF-file-with-uncompressed-picture-data.  The second blog will offer some thoughts about JPEG 2000–one motivation for the format comparison was to size up JPEG 2000–and also PNG.  We are not aware of any preservation-oriented libraries or archives that employ PNG as their master target format.  The absence of experience narratives for this particular application left us with only a moderate level of confidence in this part of our comparisons.

The “which format” question has two dimensions, although it is not clear that these are always carefully attended to.  One aspect is the wrapper, what some would call the file format (although that is narrower than the definition provided in the FADGI glossary).  TIFF is an archetypal example of a wrapper–you have a header and a handful of structural features–and it can contain a number of different picture-data encodings.

These days, the most frequently used encoding employed by memory institutions is uncompressed, barely an encoding at all. With uncompressed data, the raster (aka bitmapped) data is stored in a straightforward manner, one sample point after another in a grid.  (The term raster connects back to the word rastrum, the name for a five-pointed pen used to draw music staff lines, a tool that resembles a rake and connects us to Latin radere, more or less to scratch or scrape.)  Specialists call the sample points where the grid lines intersect picture elements or pixels.

The values stored in the file on a pixel-by-pixel basis may represent grayscale or color information in varying degrees of precision, depending on how many bits are allocated to each pixel.  An uncompressed data structure has one powerful strength: it is relatively transparent.  It would not be difficult to build a tool to read the wrapper information and also unpack the rasterized data in order to present the image.  To be sure, there is a correlative weakness: the lack of compression makes for big files.

Rastrum used to ink a music staff on paper. From  http://en.wikipedia.org/wiki/File:Rastrum02.jpg

Rastrum used to ink a music staff on paper. From http://en.wikipedia.org/wiki/File:Rastrum02.jpg

Uncompressed TIFF files consume a lot of storage space, and each time you summon one up, it takes a bit of time to read back from the storage media and travel thru the network to your display device.  Although not extensively used at the Library of Congress, TIFF does support the use of the LZW compression algorithm, which will generally cut the size of grayscale or color bitmap in half, with a corresponding decrease in transparency.

The TIFF wrapper specification was developed by the Aldus Corporation, with some Microsoft connections, in the 1980s, and moved to Adobe in the 1990s more or less when Adobe bought Aldus.  The most recent complete specification, version 6, dates from 1992.   It is a very open and well documented industry standard, i.e., not a capital-S standard from a Standards Developing Body like the International Organization for Standardization (ISO).  As the 1992 date indicates, TIFF is a little long in the tooth, although its endurance in time can be seen as a strength, especially considering the wide array of applications that can read it.  Worth noting, however, is the fact that the application array is not as deep as one might wish: TIFF files cannot be read natively in most browsers (you typically need a plug-in, but there are plenty around).  Apple’s Safari is notable exception.

Meanwhile, there are schools of thought about embedding metadata in digital files, and digital library folks sometimes debate about what type, how much, and even whether it is a good idea to embed.  (This writer is strongly in favor of embedding a “core” chunk, including an identifier that gives folks a bread crumb trail back to, say, a bibliographic record or other metadata in a database.)  The TIFF header can carry an identifier although there are differences of opinion as to exactly where and how.  But for those keen on what librarians call descriptive metadata, the native TIFF header is not so helpful.  Many folks (especially professional photographers) solve the problem by using Adobe’s XMP specification (now an ISO standard) together with the IPTC metadata standard but, at the Library of Congress, we have not yet taken the plunge.

Part Two of this series appeared on Thursday, May 15, 2014.