Xporting Digital Format Sustainability Descriptions as XML

The following is a guest post by Carl Fleischhauer, a Digital Initiatives Project Manager in NDIIPP.

In 2003, we began drafting descriptions of digital formats, intended to support the Library’s preservation planning. Knowing that our descriptions would be of general interest, and wishing to work cooperatively with emerging format registries (e.g., the Unified Digital Format Registry), we soon began posting our descriptions online. Today the offering includes descriptions of 308 formats and subformats.

Learning XML, by Schönemann, on Flickr

Learning XML, by Schönemann, on Flickr

In our nine years of operation, we have been gratified by the interest shown by other organizations and individuals. We get mentioned as a source here and there, for example in the format preservation page at the Binghamton University Libraries Web site, in various pages in the Archivematica Wiki and, more recently, in another Wiki from the energetic Let’s Solve the File Format Problem! project. From time to time, various writers have cited our analytic framework, for example in this article by the professional photographer Jeff Schewe.

At first, we created our descriptions in HyperText Markup Language. In 2007, we began to move toward eXtensible Markup Language as the drafting format. We planned to treat the XML versions as master copies and to produce the online HTML files via an XSLT transformation (Extensible Stylesheet Language Transformation, a kind of script that reformats marked-up text or data). We started by converting our existing HTML into XML in a semi-automated process that, nevertheless, required a lot of hand editing. I blogged about the conversion process in March 2012: Formatting the Formats Pages.

Why bother with XML? As I wrote in 2012, XML markup can describe the different pieces of information using “tags” that convey the meaning of each chunk of text. Thus XML files can support a broader set of uses for the underlying data than HTML pages. For example, an interested organization could take the XML versions of our documents and apply an XSLT that recognizes the tags that are meaningful to the organization and use them to extract selected segments. A format registry like the UDFR could extract particular elements from our dataset to supplement their own format-specific data.

By the end of 2012, we had converted all the old descriptions to XML and had begun creating new ones in that format. With the help of our expert consultant Ignacio Garcia del Campo, we also refined the pair of XML Schema Definition files (.xsd extension) that we use. The refined versions carry the version number 1.0. There is a primary schema that uses an xsd:include declaration to reference a subsidiary schema that handles HTML styling within the longer text fields.

Now, in 2013, we are pleased to announce the availability of the XML versions as well as public access to the pair of XML schemas. We have added an introductory page to the site that provides links to the various resources, including a pointer to the ZIP file that contains the full set of XML versions of the Format Description Documents. There is also an instruction for those who seek a single instance and not the whole set.

We hope these XML versions will useful to others. We are always eager to receive comments from our users. Send a note to help us correct errors or to suggest formats that we should describe. Although this activity is not a full time job for any of us, we will do what we can to respond.