Where are the Born-Digital Archives Test Data Sets?

By Butch Lazorchak and Trevor Owens

We’ve talked in the past on the Signal on the need more applied research in digital preservation and stewardship. This is a key issue addressed by the 2014 National Agenda for Digital Stewardship, which dives in a little deeper to suggest that there’s a great need to strengthen the evidence base for digital preservation.

But what does this mean exactly?

Scientific fields have a long tradition of applied research and have amassed common bodies of evidence that can be used to systematically advance the evaluation of differing approaches, tools and services.

This approach is common in some areas of library and archives study, such as the Text Retrieval Conferences and their common data pools, but is less common in the digital preservation community.

As the Agenda mentions, there’s a need for some open test sets of digital archival material for folks to work on bench-marking and evaluating tools against, but the first step should be to establish the criteria for data collections.  What would make a good digital preservation test data set?

1. Needs to be real-world messy stuff: The whole point of establishing digital preservation test data sets is to have actual data to be able to run jobs against. An ideal set would be sanitized, processed or normalized to the least extent possible. Ideally, these data sets would come with some degree of clearly stated provenance and a set of checksums to allow researchers to validate that they are working on real stuff.

2. Needs to be public: The data needs to be publicly-accessible in order to encourage the widest use, and should be available via a URL without having to ask permission. This will allow anyone (even inspired amateurs) to take cracks at the data.

3. Needs to be legal to work with: There are many exciting honey pots of data out there that satisfy the first two requirements but live in legal grey areas. Many of the people working with these data sets will operate in government agencies and academia where clear legality is key.

There are some data sets currently available that meet most of the above criteria, though most are not designed specifically as digital preservation testbeds. Still, these provide a beginning to building a more comprehensive list of available research data, on the way to tailor-made digital preservation testbeds.

Some Initial Data Set Suggestions:

The social life of email at Entron - a new study from user chieftech on Flicker.

The social life of email at Entron – a new study from user chieftech on Flicker.

Enron Email Dataset:  This dataset consists of a large set of email messages that was made public during the legal investigation concerning the Enron corporation. It was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes) under the auspices of the Defense Advanced Research Projects Agency. The collection contains a total of about ½ million messages and was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation.

NASA NEX: The NASA Earth Exchange is a platform for scientific collaboration and research for the earth science community. NEX users can explore and analyze large Earth science data sets, run and share modeling algorithms and collaborate on new or existing projects and exchange workflows. NEX have a number of datasets available, but three large sets have been made readily available to public users. One, the NEX downscaled climate simulations, provides high-resolution climate change projections for the 48 contiguous U.S. states. The second, the MODIS (or Moderate Resolution Imaging Spectroradiometer) data offers a global view of Earth’s surface, while the third, the Landsat data record from the U.S. Geological Survey, provides the longest existing continuous space-based record of Earth’s land.

GeoCities Special Collection 2009: GeoCities was an important outlet for personal expression on the Web for almost 15 years, but was discontinued on October 26, 2009. This partial collection of GeoCities personal websites was rescued by the Archive Team and is about a terabyte of data. For more on the Geocities collection see our interview with Dragan Espenscheid from March 24.

There are other collections, such as the September 11 Digital Archive hosted by the Center for History and New Media at George Mason University, that have been used as testbeds in the past, most notably in the NDIIPP-supported Archive Ingest and Handling Test, but the data is not readily available for bulk download.

There are also entities that host public data sets that anyone can access for free, but further investigation is needed to see whether they meet all the criteria above.

We need testbeds like this to explore digital preservation solutions. Let us know about other available testbeds in the comments.