Continuing with our series of blog posts devoted to the upcoming Digital Preservation 2014 conference, the following interview features a preview of the panel session entitled “Research Data and Curation” with panel members Inna Kouper (Data to Insight Center at Indiana University), Elizabeth Yakel (University of Michigan School of Information) and Ixchel Faniel (OCLC Research).
Susan: Could you each provide a short overview of what will be covered for this panel session?
Inna: The panel is titled “Research Data and Curation” and it will address some of the challenges of preserving and curating research data, focusing particularly on data re-use. We plan to discuss the stages that research data go through as it is being collected, analyzed and used and how we can contribute to making sure that the data can be used or understood by others in the future. The problem is large and quite complex, so each of our talks will cover a certain aspect of it.
Dharma Akmon and I will talk about complexity and heterogeneity of data products and its implications for preservation. We approach the problem from a systems perspective and conceptualize heterogeneous bundles of data as research objects that transition from a live stage into a curation stage and then into a publication stage. Each stage is characterized by varying degrees of mutability. This model allows us to formalize two cases of data reuse – revisions and derivations – and use those formalizations to track provenance of data. We will use examples from the SEAD project to demonstrate how the model works.
Elizabeth Yakel and Ixchel Faniel approach the complexity of data curation from the perspective of multiple actors that are involved in data re-use and expand the notion of preservation from preserving the bits into capturing the meanings. We hope that between our two talks we will generate a rich discussion of how to capture context and content and what can or cannot be formalized in data curation.
Beth and Ixchel: Our presentation, “Three Perspectives on Data Reuse: Producers, Curators, and Reusers” presents a real-life instance of data sharing, data curation and data reuse. Unlike previous studies that concentrate on one perspective (that of data producers, repository staff or data reusers) our case study follows the data from sharing to reuse and captures the different perspectives of participants along the way.
Susan: Concerning the issues surrounding data reuse for archeologists – why is this becoming so important in this field?
Beth and Ixchel: In the past, archaeologists focused on a single site, building a deep understanding of the culture, economics, and social structures within one locality. However, the nature of research questions has changed; archaeologists now want to examine larger social, economic and cultural transitions between ancient civilizations. No one archaeologist could possibly survey or excavate the number of sites needed for these broader research questions. This creates an imperative for data sharing and new opportunities for collaboration. As a result, data reuse has become increasingly important, although still not the norm. Disciplinary culture, logistics and legal questions surrounding data ownership are all factors impeding a more open data ethos.
Susan: You cite the need to capture transformations across the entire lifecycle of digital data. Could you define transformation in this context?
Inna: We see transformations as changes in states that data entities go through. Following a model proposed by D. DeRoure, C. Goble and others, we consider research data as bundles of resources that can be “live” and modifiable at some point and then “fixed” and immutable at other points in time. Such bundles, or research objects, go through the processes of collection, compilation, cleaning, re-arrangement, computation, aggregation and so on. They receive additional descriptions and re-arrangements during the publication stage. A bundle can be later downloaded and used by a researcher from the same or a different field and then a new bundle of resources will be generated that will be related but different from its original. All these changes in state and content of resources need to be identified, captured and tracked.
Susan: Why is it important to capture and preserve the data throughout it’s lifecycle? Is this particularly important in scientific research, more than other fields?
Inna: Curating research data throughout it’s lifecycle means that we capture information about who created the data and how, what was excluded or included, how the instruments were developed and calibrated and so on. It is particularly important in science, because it helps to establish trust and authority, ensure data quality and interpretability and realize its cumulative potential. But it can be equally important in such fields as journalism for the same reasons. Capturing as much as possible about processes and contexts of data collection from the beginning rather than at the end can also help us to avoid duplicating efforts and repeating the same mistakes. It is a tough challenge though.
Beth and Ixchel: There is a symbiotic relationship between different phases of the data lifecycle. Decisions made during collection and initial documentation can affect how easy or hard it is to share data; the condition of the data and the documentation affect the time it takes repository staff to process data; and that in turn influences data reusers ability to reuse data or even their decision to expend the effort to try to reuse data. The list of contextual elements which are important to capture is long, but some of the most important elements are data descriptive information, research design/methods and relationships among data.
Capturing and preserving information (dare we say metadata) or the context of different stages of the data lifecycle is important in documenting any type of data intended for reuse. That includes administrative data generated by government agencies, qualitative interview or observational data or scientific data. This is simply part of the process by which the meaning of data is transmitted over time. Preservation of the meaning is as important as preservation of the bits. Preserved bits are useless if the context for interpretation is not preserved.
Susan: What will the audience discussion be focused on?
Beth and Ixchel: Specifically, we would like to talk about what we can/should expect from data producers/sharers, repository staff and data reusers. Also, what type of education each needs to curate data at their point in the lifecycle. And finally, how can we align incentives around a common goal of sharing, preserving, and reusing high quality data and documentation?
Inna: I’d like the audience to help us think about the gaps in our approaches to preservation of research data. Is it effective to apply the existing preservation frameworks to digital data? What are we missing, especially when we’re trying to develop tools to support and automate data preservation? How are curation and preservation connected to data publication? Is it useful to distinguish between published and preserved/archived data objects or should we change our concepts and metaphors in the age of digital fluidity? What does reuse mean for the research data lifecycle? These and many other practical and conceptual considerations can become the focus of our discussion.
I’m really looking forward to the discussion and would welcome any contributions that would add more details and nuances to the picture of data curation and help this area move forward.