In June I did a post highlighting segments of the digital stewardship universe that could use applied research attention. I looked at the “what” of email archiving here and the “how” of email archiving here and now I turn my attention to format migration.
The need to migrate file formats arises out of concerns about format obsolescence. As I mentioned in my original post, there are ongoing discussions about how acute the format obsolescence problem might be, but for the purpose of this post we’re going to assume that migration is a possible solution to digital stewardship challenges and concentrate on useful resources that support the activity.
In my original post I proposed a series of largely technical questions that a researcher might ask regarding format migration, mostly about what happens to files and the information they contain in a migration process. This time around we’ll look at the infrastructure needed to do format migration and in a future post we’ll look at the results of a few migration experiments.
The first step in the infrastructure are the format registries. Format registries, such as PRONOM, developed by the UK National Archives, and the Unified Digital Format Registry developed by the University of California, provide detailed documentation about data file formats and their supporting software products. The format registries are important because we need to know as much information about the documented state of a format before we can understand what changes take place in a transformation.
[And while it's not a format registry, the Library's Sustainability of Digital Formats site has a lot of useful information in this area.]
The next step are tools that draw on the registry information to support the automated identification of file formats. Some interesting tools include FIDO, the Format Identification for Digital Objects Python command-line tool; the DROID Digital Record Object Identification tool; and JHOVE and JHOVE2. Each of these tools support file format identification, validation and characterization to varying degrees, though I’m not qualified to discuss their significant differences (I’ll let the developers point them out in the comments!).
They’re all similarly interesting for our purposes in that they allow the “identification” process to be incorporated into automated workflows along with a suite of other identification/characterization/migration/evaluation tools.
The next thing you need are files to migrate. I’m sure you’ve got plenty of your own, but if you’re working at scale you may want to access large corpora of data such as that provided by Biomed central. The Planets testbed was a very effective research environment hosted by the European Planets project to facilitate practical experimentation in digital stewardship and to enable users to repeat experiments in order to validate the results, but I’m still trying to clarify its current status. The successor to Planets, the Open Planets Foundation, does maintain a Formats Corpus.
On a side note, the National Software Reference Library has a research computing environment containing some 18,000,000 unique original files, along with a database containing metadata about the files. They do allow researchers to run an algorithm against the file collection by submitting a job (in code form) to the NSRL who run it for you.
Last but not least you need software tools to do the migrations. Here is where it starts to get complicated. A great place to start is the work being done by SCAPE, the SCAlable Preservation Environments project funded by the European Union and coordinated by the Austrian Institute of Technology. They’ve authored a report that looks at what they call “preservation action tools” developed under the Planets, CRiB and RODA projects. The paper introduces models for assessing the appropriateness of any particular piece of software for preservation migration purposes.
Another useful site is the Conversion Software Registry maintained by the Image and Spatial Data Analysis Group at the University of Illinois at Urbana-Champaign National Center for Supercomputing Applications. The registry is a repository of information about software packages that are capable of file format conversions, particularly tools to help identify conversion paths between formats.
There are proprietary tools already used in some domains (such as the geospatial community) that support the mass transformation of data across multiple formats, but they’re designed more to support the movement of data between databases and applications. It’s not clear to what degree (if any) they’ve considered preservation as a significant use for their tools, but it’s an area for future exploration.
In a future post we’ll take a closer look at the outputs from some migration efforts. Feel free to identify experiments or other migration tools and services in the comments.