Where is the Applied Digital Preservation Research?

A screenshot of Zen Tag, a naming activity — where participants just name what they see.

A few months back, during the Personal Digital Archiving 2013 conference, I was struck by how much interesting research was being done in the field of digital preservation. Everything from digital forensics to gamification, all of it thoughtful, much of it very practical and applicable. Still, I couldn’t help wishing that there was even more going on.

In NDIIPP we often interact with granting organizations and get a peak at the types of things proposers are hoping to get funded. While many useful things are proposed and get funded, I’m struck more by the types of things that I don’t see as often: proposals for practical, applied research that directly address long-time digital stewardship challenges or that build on other stellar research to establish a focused advance towards solutions. Many of the issues that need more focus are the types of things that cause organizations to wait on digital stewardship because the problems aren’t solved yet.

So I started writing down a list of things that might merit further attention from researchers and funders. I haven’t done an exhaustive search to see what’s currently being done in these areas (please point things out in the comments!) nor have I thought through all the challenges of doing these types of research (that’s for the researchers!) but I do think these merit further attention.

My inspiration for encouraging applied research is the work NDIIPP did back in 2005 with the Archive Ingest and Handling Test project. The AIHT was designed to test the interfaces specified in the architectural model for NDIIPP. The researchers ended up discovering that “even seemingly simple events such as the transfer of an archive are fraught with low-level problems, problems that are in the main related to differing institutional cultures and expectations” (from its final report (PDF)).

The observations that came out of these discoveries, rather than being irritating sidebars to the “real research,” actually provide ample practical value to future researchers engaged in similar digital preservation activities.

The GeoMAPP project took a similar approach to try and surface unexpected results by having the participants transfer their geospatial data collections back and forth between the different states, exposing each to new approaches and the challenge of  “last mile” transfer, storage and network infrastructures.

This is the kind of unexpected knowledge that can come out of applied research, the kinds of efforts that might be applied to some of the areas below:

Format Migration: What happens to any particular file when you migrate the file from one version of software to another? What happens when you migrate from one software type to another, for example, converting files from one type of word processing software to another? What changes happen to the file and the information inside and can these changes be quantified and measured? How can we quantify the changes that happen and determine if they have any import for digital preservation actions? Is it possible to do this all of this at scale and be able to manage the changes in a coherent way?

There is often talk in the digital stewardship community about format obsolescence and the need to address this issue in the future. The need to address format obsolescence has become a truism in the digital stewardship community, and while it may be a vexing problem, there is still doubt about how acute the problem might be. Still, we’ll need answers to the questions above in order to determine whether the need to address format obsolescence through migration is worth the cost of doing so.

Fixity Checking: How often do we need to check the fixity value of any particular digital file to ensure that it remains the same? Is there a risk in touching files too much? Is there an optimal amount of contact that will ensure authenticity while limiting risk and cost? Will regular fixity checking give us more accurate error rates for different types of digital storage? Are there increases in error rates based solely on fixity checking? What are the actual computing costs of checking the fixity of digital files at scale?

Bill Lefurgy described the importance of file fixity in an earlier post as “critical to ensuring that digital files are what they purport to be, principally through using checksum algorithms to verify that the exact digital structure of a file remains unchanged as it comes into and remains in preservation custody.” The NDSA is making efforts to uncover member approaches to file fixity through its regular “storage survey,” while individual members are aware of the value to regularly check the fixity of the digital materials under their purview. The Scape project is looking at this, as is the computer industry. Still, it’s the digital preservation community that is taking the lead in considering these issues, and much more work needs to be done to get some basic data on what happens when we do these types of activities.

Email Archiving: What are the main challenges of email archiving? How can preserved email be made accessible? Is it possible to “weed” irrelevant email messages from those that are archival through automated processes? How can email attachments be preserved along with the messages themselves? How much storage does an average email archive require?

Preserving Email: Technology Watch ReportEmail archiving is a prime concern for archival institutions, especially those in government. Email archiving solutions are strongly weighted towards the type of email system employed by the organization, and as such, much of the research in the backup and storage of email has been ceded to the information technology industry. It’s uncertain whether the IT approach takes archival concerns into consideration, however, and there remains a shortage of research on email from the archival perspective that might inform IT industry practices. The Collaborative Electronic Records Project focused on the preservation of email, and there has been some research on the archival side into tools that make email archives accessible, such as Muse. Chris Prom’s definitive DPC Technology Watch Report on Preserving Email (PDF) suggests a wide range of potential research paths, but it’s unclear if more practical work has built on his excellent observations.

Thoughts on the above questions? Areas that you think need further research? We’d love to hear your thoughts in the comments.