Hey Content Creator: Make Mine Lossless!

I’ve always loved the term “lossy” compression (add a “y” to anything and the “cute” factor really goes up). But just like a baby tiger is cute only so long as you understand that it will one day grow into a vicious, man-eating beast, lossy compression is cute only so long as you understand that it may someday come back and bite you if you’re thinking about long-term preservation.

Digital Compression by user spacepleb on Flickr

Digital Compression by user spacepleb on Flickr

That sounds a bit hyperbolic so let me step back a bit. In 2011 I wrote about IDOM, four simple steps to helping you start thinking about how to preserve your own digital materials (for the record, it’s Identify, Decide, Organize and Make copies). One undeniable factor in “make copies” is that there’s a trade-off everyone has to make between quality and affordability.

We all want to store our digital data at the highest quality possible, but higher quality generally means larger file sizes, which means more storage which means more money. Compressed data, generally speaking, takes up less physical storage space and moves more easily over networks. The file size difference can be dramatic.

Let’s say you wanted to rip your CD collection and store it as high-quality WAVE files on an external hard drive. A digital file that holds a typical three-minute song on a CD is 30–40 megabytes in size so an average CD would be around 450 megabytes. If you had 1000 CDs in your collection you’d need about ½ a terabyte of storage. Things aren’t so bad these days, cost-wise: ½ terabyte would only run you about $40 (10 years ago it would have run you almost $1200.)

vinyl kills the mp3 industry by user karola on Flickr

vinyl kills the mp3 industry by user karola on Flickr

Now lets say you wanted to save storage space by compressing the audio. The MPEG Layer III Audio Encoding (MP3 for short) typically reduces the file size for an audio song by an order of magnitude. So that half a terabyte would now be around 220 gigabytes and cost you roughly $20 total (prices for external hard drives fluctuate quite a bit so don’t hold me to these prices!).

However, when we’re thinking about preserving digital information we generally want to avoid compressing the data, unless we can compress it “losslessly.” “Lossless” compression means that we can shrink the size of any arbitrary piece of digital content, but we can also bring it back to its original size without losing any information in the transformation process.

“Lossy” compression, on the other hand, is a data encoding method that compresses data by removing part of it. Different compression schemes apply different algorithms to determine how to effectively discard the data while keeping the image within an acceptable level of quality as determined by the user’s needs, but there’s no getting around the fact that once the data is discarded under “lossy” compression schemes it’s gone for good.

While institutions (and individuals) want to save on costs as much as possible, we all want to retain as much of the utility of the information as we possibly can. We have no idea how much storage or bandwidth will cost in the future (hopefully less) nor do we know what future users might do with current data (undoubtedly many interesting things), but we’re pretty sure we want to keep our options open.

An MP3 is an example of lossy compression. If you compress that original WAVE file utilizing the MP3 compression scheme the information you remove to decrease the file size is gone for good and you can’t bring it back. It is possible to convert your MP3 back to a WAVE file using available software tools, but all you’ll have is a mediocre WAVE file. The original information is gone and you can definitely hear the difference.

So if you want to preserve an audio file for the long-term you either need to keep it in its original format or utilize a compression scheme that allows you roll back your compressed file to its original form.

There are a number of lossless compressions schemes for audio, though they’re not implemented equally by the major digital media players.

The same holds true for photographs. For example, let’s look at my “butch dogg” picture from the IDOM article.

This image is stored in Joint Photographic Experts Group (JPEG) format which is a compressed format. Sadly, JPEGs are a form of “lossy” compression.

Of course, a large amount of data can be discarded before the result is sufficiently degraded to be noticed by the user, but it’s the same situation as the audio described above. Had I been thinking long-term I might have made a different decision on the final-state format for my photo.

If planning these things out from the start, it’s most advantageous to start with a high-resolution master lossless file that can then be used to produce compressed files for different purposes; for example, a multi-megabyte file can be used at full size to produce a full-page advertisement in a glossy magazine while a smaller, lossy copy can be made for a small image on a web page.

A consideration of lossy vs. lossless compression is just one factor in identifying sustainable stewardship practices, but it’s an important one to consider, especially at the start of a digital workflow. The Still Image Working Group of the Federal Agencies Digitization Guidelines Initiative has been exploring these issues in great depth.

Consensus is still developing on most sustainable preservation master formats (see recommendations from NARA, the American Society of Media Photographers and others) but compression is certainly one of the big issues to consider.

The stewardship community will undoubtedly spend plenty of time managing and preserving lossy files (huge numbers of JPEGs and MP3 files are already out there), but if you’ve got the option make yours lossless!