Part 1: What we did and what we found This post was originally posted on Medium. How do brain imaging researchers manage and share their…
The case for open data is increasingly inarguable. Improved data practice can help to address concerns about reproducibility and research integrity, reducing fraud and improving patient outcomes, for example. Research also shows good data practice can lead to improved productivity and increased citations. However, as Grace Baynes reports, recent survey data shows that while the research community recognises the value […]
By John Borghi and Daniella Lowenberg Yesterday we talked about about why researchers may have to make their data open, today let’s start talking about…
Researchers are increasingly encouraged to make their data openly accessible and usable for others. To early-career researchers in particular, this can seem daunting, with different considerations when posting data publicly rather than retaining it solely for internal use. Katherine Wood has compiled a short open data starter guide to make the process less overwhelming and help researchers do their bit for […]
We’re all talking about reproducibility, but what are we actually talking about?
Well… it’s complicated.
The Jisc Research Data Spring workshop at the Warwick Conference Centre in Coventry had some welcome moments of blue sky before the mid-December dull grey set in. These included a breakout session from one of the projects ; Collaboration for Research Enhancement by Active Metadata (CREAM). Their breakout session explored the active use of metadata in the arts and sciences, a theme the project members have been exploring for some time .
The workshop was titled ‘Observations on Commonalities of Process’, and led by Iris Garrelfs and Graham Klyne’s two-handed presentation on key parallels between the arts and sciences. Iris Garrelfs spoke as a PhD student and artist who works “on the cusp of music, art and sociology” . Graham Klyne of Nine by Nine , spoke as an ex University of Oxford bioinformatician, and contributor to many semantic web standards.
This seemed unaccustomedly philosophical territory for a Jisc programme workshop, in my experience anyway. And, despite any seasonal temptation, nobody made any rubbish puns about C.P Snow, or made too much of his big theme. It was the rifts between the ‘two cultures’ of the sciences and humanities, which the chemist-turned-novelist famously wrote about  and are still with us to today. Much of the research-driven impetus behind Research Data Management has come from the STEM disciplines. Perhaps understandably, given the impact of the EPSRC’s data policy on UK institutions, this has antagonised many humanities researchers who would rather deal with policy directions couched in their own terms. So I guess if Snow were still around he would have approved of this session.
Rather than becoming bogged down in the differences of terminology and epistemology, the session brought fresh thinking on common methods and tools for dealing with arts and humanities metadata. The four discussion themes were:
- planning and agility
- workflow and lifecycle
Each theme was introduced by Iris and Graham, based on the project’s effort to develop a model for ‘active metadata’. They included reflections on the research processes followed by artists at University of the Arts London, and by chemists and geoscientists in Southampton and Edinburgh. So there were many contributions from collaborators Athanasios Velios, Simon Coles, and others outside the project.
1. Planning and agility
Some of the fresh thinking mentioned earlier is in the shape of Iris Garrelf’s Procedural Blending model. This is an abstract framework for describing creative processes, set out in her PhD thesis , and based on her work in sound art.
If I picked up Iris’s quick introduction to the model correctly the gist of it is that creative processes do not follow a stepwise linear process from input to output, but blend parallel strands of action (or ways of framing a problem), that become joined together at key points in the research process. The question is, how can this be recorded in useful ways?
Provenance metadata is part of the answer for CREAM. Reflecting on his involvement in the W3C PROV collaboration, Graham Klyne’s take on this standard for provenance metadata  was that it offers a very useful structure for encoding process, but it is not forthcoming about describing its less mechanical aspects. The Procedural Blending model, he said, has offered a fascinating counterbalance to PROV. It may offer the provenance standards a broader framework for these less tangible aspects of data management. Of course provenance is a retrospective record of action, and research planning and workflow design are prospective. Addressing the tacit and intangible seems key to working out how to apply the provenance metadata emerging from a project as a resource for planning-in-action.
At first arts-science differences were most evident when the project began working out how to take that forward, but then parallels became clear. These include several aspects of the trade-off between planning and agility in research.
- Amendments and changes in process. The project has considered research around chemical reactions, responses to planning that research, and the role of improvisation. At one extreme improvisation can be thought of as ‘developing a plan in the moment’. At the other it can refer to points where a researcher is responding to observations and adapting (say) a spreadsheet to record experiment outcomes.
- Re-framing. Iris pointed out that artists are used to taking conceptual and physical objects and turning them on their head to look at from a different perspective, whether literally or metaphorically. Science aims to nail down processes in a more definitive and reproducible way. But as Graham and others commented, the way that science research is reported suggests design that is more planned than it is in practice. So CREAM has become focused on the messy aspects of research design; not just when the milk gets spilled, as it were, but acting on the smell of it; those points when arbitrary choices are made, or the data that researchers are faced with suggest a new line of investigation. Here it is detailed knowledge of background that makes for the ability to make decisions about what line to take.
These points resonated with challenges to reproducibility that the RDM community is trying to address. Simon Coles mentioned for example that a minority, perhaps 20%, of chemical syntheses are reproducible, because of tacit knowledge and arbitrary decisions that do not get recorded.
Neil Jefferies made a further connection with the under-reporting of negative results, and the idea that capturing this vast body of knowledge of ‘what didn’t work’ could save time by identifying what won’t work in future. Simon Coles pointed out that accounting for negative results that don’t go according to plan is a very different thing from accounting for agility in planning. And Southampton ex-colleague Mathew Addis pointed out that the desire for this level of accounting varies by discipline, but isn’t restricted to academia. Chemists try to record everything, and so does the pharma industry.
The Southampton University Chemistry groups’ work with electronic lab notebooks (ELN) has taught them a thing or two about what people actually record and do with paper and digital notebooks. The greater sense of ownership researchers have of paper notebooks affects their willingness to make the switch. ELN take-up is difficult where there are specific research values around data ownership, and they are investigating ways to encourage submission of ELNs.
Humanities scholars tend to deal with the subjectivity of research decision-making differently to scientists. Iris spoke about artists and scholars concern to understand motivations and influences. She pointed out that history and archaeology are concerned with similar problems as scientific reproducibility – doing forensic studies of ‘how people got there’. Generalising across the humanities, the view tends to be much more that provenance is debatable, while for scientists it’s a record of the path of their research that does not need or deserve debate.
These differences can be productive though; some present commented on the value of capturing motivations in science. There is also value in drawing out the role of decisions that can’t be planned for e.g. apparently arbitrary decisions about what is looked at, or selected for analysis. Conventional recording processes rarely allow for this in the sciences, and appraisal processes emphasise deliberate and rational choice.
Having light shed on arbitrary decision-making from models of the creative process may help to incorporate into the scientific record metadata on how ideas are made, and how spontaneity is dealt with. As Simon Coles pointed out research depends on the ability to deviate from a plan on the fly. And as Neil Jefferies remarked, collaborations often depend on people knowing that they share a similar feeling about the problem at hand that comes from their aggregated history. Exposing those aspects that influence how decision are made could help with reproducibility, and are challenging to record.
Workflow and lifecycle
If you have tried to apply research data lifecycle models in practice and thought ‘ok that’s fine for the fly-by overview, but life’s not like that’, you will probably appreciate the problem CREAM is trying to tackle. One of the marked similarities the project found was between the procedural blending model and models of the research lifecycle more common in the sciences.
There are well-known barriers to the practicality of documenting more fine-grained and realistic metadata; the prime one being justifying the expense of doing it. But there are nuances to the cost-benefit trade-offs. Obviously automating the metadata gathering helps, but only of the metadata is more meaningful and useful than that which is handcrafted but based on fallible memory and hindsight-based rationalisations about what happened. This is where the CREAM collaborators believe workflow models that allow for provenance metadata to be applied prospectively may help. So far, they said they had been pleasantly surprised at how much the fluidity of the artistic view could be useful to shed light on scientific process and, from that artistic perspective the potential of scientific workflow techniques for recording process.
Ownership and attribution issues were the key ones highlighted at this point in the discussion. Copyright and plagiarism concerns drive a reluctance to record research processes. For some present that pointed to the need to enable a hierarchy of access to data. Workflows for research data sharing must allow for much of the data to be kept to known collaborators for much of the time. The RDM community’s general invocation to be as open as possible, as quickly as feasible, can drown out that message.
Fiona Murphy, who coupled her humanities background with her experience in scientific publishing, highlighted an important question – how important is it who actually makes the observations that create data? From a reproducibility perspective these observations should, in principle at least, be independent of the observer making them, but how is that actually viewed in practice? Some of the scientists present were happy to acknowledge that some people are better than others at making observations. Had there been more humanists or sociologists of science in the room this might have sparked further debate about epistemology, or about how researchers’ biographies and social networks actually affect what research gets done.
Other participants reiterated the earlier point that science reporting also tends to play down the creative elements of the research process. And from her arts background Iris Garrelf mentioned that the convention of working within a genre, and following its rules of provenance (among other things), has similarities with the call for reproducibility in science.
Two main points wrapped up this session; firstly that scientific and humanistic datasets can be used by researchers on the other side of the divide for purposes neither imagined, so it makes sense to have common data management frameworks. The other was to encourage researchers, and others involved in the RDM field, to go beyond a mandate-fulfilling view of reproducibility. Records of process aren’t just useful for re-treading your own path, they can be a resource for doing things outside your own field.
From my own point of view I liked this workshop a lot, and was pleased to see there’s a similar one planned for IDCC . Many of the themes will be familiar to provenance researchers and also touch on sociology of science. I was also reminded of Arthur Koestler’s ‘bisociation’ theory of creative thinking . I used that in my very first published journal article [1989, lost to digital rot, but papyrus still available!] so it had plenty of personal resonances.
CREAM are pursuing a novel approach, and more recent parallels struck me. On the sociological side of things some work by the Information Systems group at the LSE on ‘Collective agility, paradox and organizational improvisation’ based on a study of particle physics research processes in the GridPP collaboration . A little more current than that, Also the Research Data Alliance has several groups addressing the ‘planning and agility’ theme. These are the interest group on Active Data Management Plans , plus another on ‘De-constructing the Data Lifecycle- Agile Curation’ .
CREAM is part of a flurry of tech development aiming for better record-making tools in research. The hope is they’ll offer metadata that’s actually useful for research before it’s done, as well as more accurate about how it’s done, all with less effort and higher usability than the traditional lab record or artists notepad. The results remain to be seen.
Photo credit: ‘just spilled milk‘ by Post Memes CC-BY-2.0
 More on the Jisc Research Data Spring projects is available at: https://www.jisc.ac.uk/rd/projects/research-data-spring
 For example: Cerys Willoughby, Colin Bird, Jeremy Frey (2015) User-Defined Metadata: Using Cues and Changing Perspectives, International Journal of Digital Curation 2015, 10(1), pp. 18-47 doi:10.2218/ijdc.v10i1.343
 Metadata in Action workshop, IDCC16, Amsterdam, 25 Feb 2016 http://www.dcc.ac.uk/events/idcc16/workshops#Workshop%2010
 Arthur Koestler (1964) The Act of Creation, London: Penguin Books
 Zheng, Yingqin and Venters, Will and Cornford, Tony (2011) ‘Collective agility, paradox and organizational improvisation: the development of a particle physics grid’ Information Systems Journal. DOI: 10.1111/j.13652575.2010.00360.x Pre-print available at: http://eprints.lse.ac.uk/30029/1/Collective_agility_%28LSERO%29.pdf
 Research Data Alliance, ‘Active Data Management Plans’ IG, see: https://rd-alliance.org/groups/active-data-management-plans.html
 Research Data Aliiance, ‘Deconstructing the Data Life Cycle- Agile Curation’ Birds of a Feather Group, see: https://rd-alliance.org/groups/deconstructing-data-life-cycle-agile-curation.html
(guest post by Johannes Otterbach) First Big Data and Data Science, then Data Driven and Data Informed. Even before I changed job titles—from Physicist to Data Scientist—I spent a good bit of time pondering what makes everyone so excited about these things, and whether they have a place in the academy. Data Science is an […]
The high-profile political science study on same-sex marriage views, now determined to be fraudulent, is the latest case exposing the failure of incentive structures in the academy. The academic community must strengthen research evidence and do more to promote transparency. Temina Madon shares the launch of prizes run by the Berkeley Initiative for Transparency in the Social Sciences (BITSS) that look to provide recognition, visibility and […]
Online publication provides us with new freedom to update, amend and extend the research article as we know it. Daniel Shanahan presents a vision of the evolution of the article beyond the limits of the printed page. Creating a living document for a single research project, updated in real time, would lead to it being evaluated based on the question […]
Published descriptions of data sets and analysis procedures are helpful ways to ensure scientific results are reproducible. Unfortunately the collection and provision of this information is often provided by researchers in retrospect and can be fraught with uncertainty. The only solution to this problem is to computationally collect and archive data files, code files, result files, and other details while the data analysis […]