Recently, we’ve started to add email formats to the Sustainability of Digital Formats website. Eventually, when we get a more robust collection, we’d like to split them out into a separate content category but for now, they (mostly) are categorized with their closest cousin, the Textual Content family.
Our genealogical research is still very much underway but let’s explore what we’ve documented so far of the email file format family tree. As explained in the Preserving Email report (PDF) from the Digital Preservation Coalition, the exchange of email messages is reliant on sustained interoperability so standardization is essential. A foundational component is the Internet Message Format which defines the syntax for email message bitstreams sent between computer users. The IMF format, for example, dictates how the to, from and subject fields in the header are to be structured while the message is passed along the chain from author to recipient so that all the systems interpret the data in the same way.
All email messages, regardless of the email client or Mail User Agent used to author or read them, conform to the IMF syntax while in transit. It’s the key to the whole email exchange system. There’s not a lot of wiggle room in IMF and that is intentional to maintain interoperability across the disparate computer systems that create, move and store email messages. Some formats, like EML (discussed below), retain the IMF structure even when stored in a mailbox. Other formats are mapped or transformed into IMF while the message is in transit depending on the requirements of the originating and receiving MUA.
If we think of IMF as the trunk of our family tree, we can think of other formats as branches off this common central structure. IMF is primarily concerned with messages in transit. The other formats on our tree define how messages and other objects are stored in mailboxes.
The largest branch on our tree so far is the MBOX email format family. Like many families, the four variants of MBOX, MBOXO, MBOXRD, MBOXCL and MBOXCL2, have some things in common but also display some differences. All MBOX variants have two things in common. First, the MBOX family concatenates all messages stored within a folder into a single file. Second, individual messages within the single file all begin with a “From “ line, continue with a series of non-”From ” lines, and end with a blank line. A “From ” line means any line in the message or header that begins with the five characters ‘F’, ‘r’, ‘o’, ‘m’, and ‘ ‘ (space). The family differences kick in when determining how best to identify the end of one message and the start of the next message within the concatenated file. MBOXO, MBOXRD and MBOXCL use “From “ line quoting. MBOXCL and MBOXCL2 rely to different extents on a ContentLength field in the header that documents each message’s length in bytes.
Another branch on our email family tree is EML, short for electronic mail or email. EML, the default format for Microsoft Outlook Express, is a direct descendant of IMF. EML files typically store each message as a single file (unlike its cousin MBOX which concatenates all the messages), and attachments may either be included as MIME content in the message or written off as a separate file, referenced from a marker in the EML file.
MBOX and EML are commonly used as normalization formats in many preservation workflows because most modern email clients and servers can import and export one or both of the formats. A previous blog post detailed some of the issues with preserving email formats and included details about using software programs to convert between the two formats and numerous proprietary formats. Once in an MBOX or EML format, the data can be parsed into XML using standardized schemas.
A little further out on our family tree are the two formats of PST, Microsoft’s Personal Folders Format. The two versions of PST, PST_ANSI and PST_Unicode, are differentiated primarily by software implementation versions, character sets, maximum file size constraints and bit values.
Finally, there’s Microsoft Outlook Item (MSG) and its generic content type parent, Compound File Binary File Format. Who would have thought that some email messages and one of the formats used for complex audiovisual objects share a common ancestor? Microsoft’s CFB implements a simplified file system through a hierarchical collection of storage objects and stream objects that isn’t limited to only email formats. The structured storage profile of CFB formed the basis for the AAF specification, a format intended to support the interchange of video materials between editing systems. The MSG format features a syntax for storing what the specification calls a single Message object, such as an email or an appointment, in a file. MSG files contain all the properties of the message including the text and any attachments.
Our email family tree will continue to grow and mature. The Library’s Manuscript Division staff, for example, have encountered a few other formats as they process personal papers collections, like the recently acquired papers of Senator Joseph Lieberman.