The “What” of Email Archiving

The Box That Done More, by delphwynd, on Flickr

The Box That Done More, by delphwynd, on Flickr

A couple of weeks ago I wrote about the need for applied digital preservation research. The post generated a number of great comments and I’ll take some time over the next few months to dig a little deeper into each subject area and try and tease out where the useful efforts are, while also identifying further gaps that might be plugged by further research.

This time I’ll dive a little deeper into email archiving. Perhaps it’s best to start with the first question from last time: What are the main challenges of email archiving?

At the highest level, Chris Prom, in his DPC Technology Watch Report on Preserving Email (pdf)  identifies two areas: perceived technological barriers;  and legal mandates that prioritize minimum legal retention periods favoring record destruction over long-term access.

While the legal and policy questions are undoubtedly important, I’m going to leave them for another day and focus for now on the technical issues. And before we can address the “how” of email preservation we need to know “what” an email message is.

A strand of research addressing the “what” for any particular format comes under the rubric of “significant properties.” I won’t go into it any more deeply, but a particularly worthy introduction to this branch of research is “Significant properties of digital objects: definitions, applications, implications” (pdf) by Margaret Hedstrom and Cal Lee from 2002.  (This path of exploration also leads to tools like JHOVE,  JHOVE2 and DROID.)

So how do we know exactly what we need to capture when we’re talking about preserving an email message? Generally speaking, there is common conformance to the “Internet Message Format” syntax (RFC 2822) across mail systems, but there’s been little to no standardization on email storage formats within clients (look at Prom’s chapter on “IETF Standards” for a much richer discussion of these issues).

"Digital preservation buttons" by user wlef70 on Flickr

“Digital preservation buttons” by user wlef70 on Flickr

Gareth Knight looked at email messages through the veneer of “significant properties” research in the 2009 “Significant Properties Testing Report: Electronic Mail,” part of the InSPECT series of testing reports on different kinds of electronic content (the folks at Archivematica have helpfully summarized the information in Knight’s report to a concise table).

Email preservation is certainly context-based: a preservationista needs to understand the email client(s) used in the organization and hone in on the format native to each one. There are a lot! One web site lists approximately 60 email-related file extensions, but Knight narrows it down to five prominent “representation formats” (Microsoft Outlook Message, Microsoft Outlook Personal Folder, mbox, Maildir and the Email Account XML schema), while offering more detail:

Representation formats are interpreted by the type of information that they contain, as opposed to any characteristic of the format specification itself. An email may be stored in any format that allows the storage of text-based information, as text (ASCII, Unicode) and binary encoded data (Microsoft Personal Folders). Variation of each encoding type is identified by the organisational structure and mark-up contained. For example, mail may be stored individually using maildir or EML, or as a combination of one or more emails in a single file using mboxrd, mboxcl, or other variations.

Prom is even more parsimonious, winnowing his list of formats to mbox and EML, with a caveat:

In the case of many proprietary clients, messages cannot be exported from their native system directly into MBOX or EML. Instead, these clients may export the message to a proprietary, though perhaps open, format. The most common of these formats are .pst (Outlook), and .nsf (Lotus). Tools … can then convert these files to MBOX or EML.

Still, Prom notes that “in general, if an institution can get email into one of the MBOX or EML formats, it has taken a very big step on the road toward preserving email.”

Success!

In a future post I’ll take a look at some of the technical approaches being explored to do the “how” of preserving email. Here are some of the most prominent approaches:

  • Migrate email to a new version of the software or an open standard
  • Wrap email in XML formats
  • Emulate the email environment
  • Retain the messages within the existing e-mail system

The Open Planets Foundation has a wiki tracking email-related technical projects, some of which we’ll highlight in future posts. Leads on projects that have explored any of these approaches is greatly appreciated!