Category: #idcc16

#IDCC16: Atomising data: Rethinking data use in the age of explicitome

Data re-use is an elixir for those involved in research data.

Make the data available, add rich metadata, and then users will download the spreadsheets, databases, and images. The archive will be visited, making librarians happy. Datasets will be cited, making researchers happy. Datasets may be even re-used by the private sector, making university deans even happier.

But it seems to me that data re-use, or at least a particular conceptulisation of re-use as is established in most data repositories, is not the definitive way of conceiving of data in the 21st century.

Two great examples from the International Data Curation Conference illustrated this.

Barend Mons declared that the real scientific value in scholarly communication is not abstracts, articles or supplementary information. Rather the data that sits behinds these outputs is the real oil to be exploited, featuring millions of assertions about all kinds of biological entities.

Describing the sum of these assertions as the explicitome, it enables cross fertilisation between distinct scientific work. With all experimental data made available in the explicitome, researchers taking an aerial view can suddenly see all kinds of new connections and patterns between entities cited in wholly different research projects.

Secondly, Eric Kansa’s talk on the Open Context framework for publishing archaeological data. Following the same principle as Barend Mons, OpenContext breaks data down into individual items. Instead of downloading a whole spreadsheet relating to a single excavation, you can access individual bits of data. From an excavation, you can see the data related to a particular trench, and then items discovered in that trench.

A screenshot from OpenContext

In both cases, data re-use is promoted, but in an entirely different way to datasets being uploaded to an archive and then downloaded by a re-user.

In the model proposed by Mons and Kansa, data is atomised, and then published. Each individual item, or each individual assertion, gets it own identity. And that piece of data can then easily be linked to other relevant pieces of data.

This hugely increases the chance of data re-use; not whole datasets of course, but tiny fractions of datasets. An archaeologist examining remains of jars on French archaeological sites might not even think to look at a dataset from a Turkish excavation. But if the latter dataset is atomised in a way that it allows it identify the presence of jars as well, then suddenly that element of the Turkish dataset will become useful.

This approach to data is the big challenge for those charged with archiving such data. Many data repositories, particularly institutional ones, store individual files but not individual pieces of data. How research data managers begin to cope with the explicitome – enabling it, nourishing and sustaining it – may well be a topic of interest for IDCC17.

#IDCC16: Strategies and tactics in changing behaviour around research data

The International Data Curation Conference (IDCC) continues to be about change. That is, how do we change the eco-system so that managing data is an essential component of the research lifecycle? How can we free the rich data trapped in PDFs or lost to linkrot? How can we get researchers to data mine and not data whine?

While, for some, the pace of change is not quick enough, IDCC still demonstrates an impressive breadth of strategy and tactics to enable this change.

On the first day of the conference, Barend Mons set out the vision. The value of research is not in journals but in the underlying data – thousands and thousands of assertions about genes, bacteria, viruses, proteins, indeed any biological entity are locked in figures and tables. Release such data and the interconnections between related entities in different datasets reveals whole new patterns. How to make this happen? One part of the solution: all projects should allocate 5% of their budget to data stewardship.

Andrew Sallans of the Center for Open Science followed this up with their eponymous platform for managing Open Science for linking data to all kinds of cloud providers and (fingers crossed) institutions’ data repositories. In large-scale projects, sharing and versioning data can easily get out of control; the framework helps to manage this process more easily. They have some pretty nifty financial incentives to change practice too – $1000 awards for pre-registration of research plans.

Following this we saw many posters – tactics to alter behaviours of individuals and groups of researchers. There were some great ideas here, such as plans at the University of Toronto to develop packages of information for librarians on data requirements of different disciplines. 

Despite this, my principal concern was the huge gap between the massive sweep of the strategic visions and the tactics for implementing change. Many of the posters were valiant but were locked in an institutional setting – the libraries wrestling how to influence faculty without the in depth knowledge (or institutional clout) to make winning arguments within a particular area.

What still seems to be missing from IDCC is the disciplinary voice. How are particular subjects approaching research data? How can the existing community work more closely with them? There was one excellent presentation on building workflows for physicists studying gravitational waves; and other results from OCLC work with social scientists and zoologists. But in most cases it was us librarians doing the talking rather than it being a shared platform with the researchers. If we want that change to happen, there still needs to be greater engagement with the subjects that are creating the research data in the first place.

Open Data Panel at #IDCC16

The 11th International Digital Curation Conference is just around the corner and we are anticipating great discussions in Amsterdam in couple of weeks.

In the first of our series of preview posts, members of the Open Data Panel at IDCC – Fiona Nielsen, Marta Hoffman-Sommer, Phil Archer, Thomas Ingraham and Jeroen Rombouts – briefly explain why open data is important, what are the benefits as well as challenges to sharing research data.

Your session will focus on open data. Are there any specific messages you would like people to take away from it?

Fiona Nielsen (founder and CEO of DNAdigest and Repositive): My take-home message would be that publicly funded research data should be made available to the research community. The earlier and the more systematically you do so, the more benefit and credit you gain as a researcher among your peers and funders. Whenever research data does not hold personally identifiable information (PII) it should be made available as soon as possible. For many types of data this means that it can be published as open data along with data descriptors, metadata and helpful advice from the authors. However, in the special case of PII the access to the data will need to be managed with a governance model that matches the consent given for its use. All data descriptors, metadata and helpful advice from the authors can and should be made available as open data to maximise discoverability, accessibility and opportunities for reuse and reproducibility of results.

Marta Hoffman-Sommer (Open Science Platform at ICM University of Warsaw): I suppose we all agree that data from publicly funded research should be shared openly. What I would like to stress is how much individual researchers stand to gain from opening data – that sharing data really can be a standard part of a successful way of doing science.

Phil Archer (W3C): I wonder whether we might try and see things from the other direction a little. I am as guilty as anyone of starting from the concept of open data and saying it’s a really cool thing to do. The alternative angle is to think about what research questions you want to answer, what prompted you to want to research that specific question – quite possibly it was someone else’s work that made you think of something. OK, so if I had access to that original data, here’s how I’d be able to use it/augment it, test a new hypothesis etc. perhaps mixed with other data that I can find that already exists. It’s important to credit other people’s work of course – that’s always the carrot.

Thomas Ingraham (Publishing Editor at Open Life Science publishing platform F1000 Research): One thing I will say is that there are many arguments in favour of open research data; most focus on the altruistic benefits to other scientists and wider society, others on the direct benefits to the submitting researcher. Best bet is to go with the former if trying to convince an organization, and the latter when trying to convince individual scientists. It helps to tailor the argument to the recipient, rather than go straight for the ‘altruistic’ arguments in all cases.

Jeroen Rombouts (Director at 3TU.Datacentrum): There are many messages for the open data public but I hope that we can make it clear that we need to share if we all want to get more value out of data, to get research on higher levels. And that the people producing and sharing the data need to be rewarded for sharing and for that we need a revolution. So start publishing and citing data and develop support and incentives!

You’ll undoubtedly have looked at the programme in preparation for IDCC. Which speakers/sessions are you most looking forward to?

Fiona Nielsen: I am looking forward to the C3 session on Data sharing and reuse. There are lots of tools and best practices for research data sharing being developed around the world, including the Dataverse project (one of the presentations in C3) and similar initiatives. I think it is crucial for the advancement of research that we learn as a community what approaches work to increase incentives for sharing and reuse, so that these approaches can be built into any and all new research data publication initiatives. 

Marta Hoffman-Sommer: What I’m especially looking forward to at the conference is to gain some new insights into the management of data in the long tail of science – mostly in the B1 session (ed. Big science and the long tail), but I expect this comes up in other sessions as well.

Phil Archer: That would be the sessions around citation and linking. I’ll be also interested to hear what Susan Halford has to say.

Thomas Ingraham: Regrettably, I can only be at the conference for the Tuesday – I am especially disappointed about missing all the interesting Wednesday sessions. Thankfully, I’ll be there for Barend Mons and Andrew Sallans, definitely looking forward to their talks!

Jeroen Rombouts: I hope to take home with me opportunities for long tail data and bright ideas on developing RDM services from B1 (ed. Big science and the long tail) and A3 (ed. Research data services).

Would you like to know more? The Open Data Panel will run on Tuesday from 11:30 to 12:30. If you haven’t registered yet for the conference, book your place now.