While a fair amount of digital preservation focuses on objects that have clear corollaries to objects from our analog world (still and moving images and documents for example), there are a range of forms that are basically natively digital. Completely native digital forms, like database-driven web applications, introduce a variety of challenges for long-term preservation and access. I’m thrilled to discuss just such a form with Karl Nilsen and Robin Dasler from the University of Maryland, College Park. Karl is the Research Data Librarian, and Robin is the Engineering/Research Data Librarian. Karl and Robin spoke on their work to ensure long-term access to the Extragalactic Distance Database at the Digital Preservation 2014 conference.
Trevor: Could you tell us a bit about the Extragalactic Distance Database? What is it? How does it work? Who does it matter to today and who might make use of it in the long term?
Karl and Robin: The Extragalactic Distance Database contains information that can be used to determine distances between galaxies. For a limited number of nearby galaxies, the distances can be measured directly with a few measurements, but for galaxies beyond these, astronomers have to correlate and calibrate data points obtained from multiple measurements. The procedure is called a distance ladder. From a data curation perspective, the basic task is to collect and organize measurements in such a way that researchers can rapidly collate data points that are relevant to the galaxy or galaxies of interest.
The EDD was constructed by a group of astronomers at various institutions over a period of about a decade and is currently deployed on a server at the Institute for Astronomy at the University of Hawaii. It’s a continuously (though irregularly) updated, actively used database. The technology stack is Linux, Apache, MySQL and PHP. It also has an associated file system that contains FITS files and miscellaneous data and image files. The total system is approximately 500GB.
The literature mentioning extragalactic or cosmic distance runs to thousands of papers in Google Scholar, and over one hundred papers have appeared with 2014 publication dates. Explicit references to the EDD appear in twelve papers with 2014 publication dates and a little more than seventy papers published before 2014. We understand that some astronomers use the EDD for research that is not directly related to distances simply because of the variety of data compiled into the database. Future use is difficult to predict, but we view the EDD as a useful reference resource in an active field. That being said, some of the data in the EDD will likely become obsolete as new instruments and techniques facilitate more accurate distances, so a curation strategy could include a reappraisal and retirement plan.
Our agreement with the astronomers has two parts. In the first part, we’ll create a replica of the EDD at our institution that can serve as a geographically distinct backup for the system in Hawaii. We’re using rsync for transfer. Our copy will also serve as a test case for digital curation and preservation research. In this period, the copy in Hawaii will continue to be the database-of-record. In the second part, our copy may become the database-of-record, with responsibility for long-term stewardship passing more fully to the University of Maryland Libraries. In general, this project gives us an opportunity to develop and fine-tune curation processes, procedures, policies and skills with the goal of expanding the Libraries’ capacity to support complex digital curation and preservation projects.
Trevor: How did you get involved with the database? Did the astronomers come to you or did you all go to them?
Karl and Robin: One of the leaders of the EDD project is a faculty member at the University of Maryland and he contacted us. We’re librarians on the Research Data Services team and we assist faculty and graduate students with all aspects of data management, curation, publishing and preservation. As a new program in the University Libraries, we actively seek and cultivate opportunities to carry out research and development projects that will let us explore different data curation strategies and practices. In early 2013 we included a brief overview of our interests and capabilities in a newsletter for faculty, and that outreach effort lead to an inquiry from the faculty member.
We occasionally hear from other faculty members who have developed or would like to develop databases and web applications as a part of their research, so we expect to encounter similar projects in the future. For that reason, we felt that it was important to initiate a project that involves a database. The opportunities and challenges that arise in the course of this project will inform the development of our services and infrastructure, and ultimately, shape how we support faculty and students on our campus.
Trevor: When you started in on this, were there any other particularly important database preservation projects, reports or papers that you looked at to inform your approach? If so, I’d appreciate hearing what you think the takeaways are from related work in the field and how you see your approach fitting into the existing body of work.
Karl and Robin: Yes, we have been looking at work on database preservation as well as work on curating and preserving complex objects. We’re fortunate that there has been a considerable amount of research and development on database preservation and there is a body of literature available. As a starting point, readers may wish to review:
- Ribiero, Cristina and David Gabriel. “Database Preservation” (PDF). Digital Preservation Europe, 2009.
- Roberts, Bill et al. “Case Study: Database Preservation at the National Archives of the Netherlands” (PDF) Open Planets Foundation, 2010.
Some of the database preservation efforts have produced software for digital preservation. For example, readers may wish to look at SIARD (Software Independent Archiving of Relational Databases) or the Database Preservation Toolkit. In general, these tools transform the database content into a non-proprietary format such as XML. However, there are quite a few complexities and trade-offs involved. For example, database management systems provide a wide range of functionality and a high level of performance that may be lost or not easily reconstructed after such transformations. Moreover, these preservation tools may involve dependencies that seem trivial now but could introduce significant challenges in the future. We’re interested in these kinds of tools and we hope to experiment with them, but we recognize that heavily transforming a system for the sake of preservation may not be optimal. So we’re open to experimenting with other strategies for longevity, such as emulation or simply migrating the system to state-of-the-art databases and applications.
Trevor: Having a fixed thing to preserve makes things a lot easier to manage, but the database you are working with is being continuously updated. How are you approaching that challenge? Are you taking snapshots of it? Managing some kind of version control system? Or something else entirely? I would also be interested in hearing a bit about what options you considered in this area and how you made your decision on your approach.
Karl and Robin: We haven’t made a decision about versioning or version control, but it’s obviously an important policy matter. At this stage, the file system is not a major concern because we expect incremental additions that don’t modify existing files. The MySQL database is another story. If we preserve copies of the database as binary objects, we face the challenge of proliferating versions. That being said, it may not be necessary to preserve a complete history of versions. Readers may be interested to know that we investigated Git for transfer and version control, but discovered that it’s not recommended for large binary files.
Trevor: How has your idea of database preservation changed and evolved by working through this project? Are there any assumptions you had upfront that have been challenged?
Karl and Robin: Working with the EDD has forced us to think more about the relationship between preservation and use. The intellectual value of a data collection such as the EDD is as much in the application–joins, conditions, grouping–as in the discrete tables. Our curation and preservation strategy will have to take this fact into account. We expect that data curators, librarians and archivists will increasingly face the difficult task of preservation planning, policy development and workflow design in cases where sustaining the value of data and the viability of knowledge production depends on sustaining access to data, code and other materials as a system. We’re interested to hear from other librarians, archivists and information scientists who are thinking about this problem.
Trevor: Based on this experience, is there a checklist or key questions for librarians or archivists to think through in devising approaches to ensuring long term access to databases?
Karl and Robin: At the outset, the questions that have to be addressed in database preservation are identical to the questions that have to be addressed in any digital preservation project. These have to do with data value, future uses, project goals, sustainability, ownership and intellectual property, ethical issues, documentation and metadata, data quality, technology issues and so on. A couple of helpful resources to consult are:
- Maron, Nancy L., Jason Yun, and Sarah Pickle. “Sustaining Our Digital Future: Institutional Strategies for Digital Content.” London, UK: JISC, 2013.
- Whyte, Angus, and Andrew Wilson. “How to Appraise & Select Research Data for Curation.” London, UK, and Melbourne, AU: Digital Curation Centre and Australian National Data Service, 2010.
Databases may complicate these questions or introduce unexpected issues. For example, if the database was constructed from multiple data sources by multiple researchers, which is not unusual, the relevant documentation and metadata may be difficult to compile and the intellectual property issues may be somewhat complicated.
Trevor: Why are the libraries at UMD the place to do this kind of curation and preservation? In many cases scientists have their own data managers, and I imagine there are contributions to this project from researchers at other universities. So what is it that makes UMD the place to do it and how does doing this kind of activity fit into the mission of the university and the libraries in particular?
Karl and Robin: While there are well-funded research projects that employ data managers or dedicated IT specialists, there are far more scientists and scholars who have little or no data management support. The cost of employing a data manager, even part-time, is too great for most researchers and often too great for most collaborations. In addition, while the IT departments at universities provide data storage services and web servers, they are not usually in the business of providing curatorial expertise, publishing infrastructure and long-term preservation and access. Further, while individual researchers recognize the importance of data management to their productivity and impact, surveys show that they have relatively little time available for data curation and preservation. There is also a deficit of expertise in general, though some researchers possess sophisticated data management skills.
Like many academic libraries, the UMD Libraries recognize the importance of data management and curation to the progress of knowledge production, the growth of open science and the success of our faculty and students. We also believe that library and archival science provide foundational principles and sets of practices that can be applied to support these activities. The Research Data Services program is a strategic priority for the University of Maryland Libraries and is highly aligned with the Libraries’ mission to accelerate and support research, scholarship and creativity. We have a cross-functional, interdisciplinary team in the Libraries–made up of subject specialists and digital curation specialists as needed–and partners across the campus, so we can bring a range of perspectives and skills to bear on a particular data curation project. This diversity is, in our view, essential to solving complex data curation and preservation problems.
We have to acknowledge that our work on the EDD involves a number of people in the Libraries. In particular, Jennie Levine Knies, Trevor Muñoz and Ben Wallberg, as well as University of Maryland iSchool students Marlin Olivier and, formerly, Sarah Hovde, have made important contributions to this project.