Open Data’s Dirty Little Secret

Earlier this week, I was very happy to take part in the Digital Science webinar on data management. I spoke about how data management should be accessible and understandable to all and not a barrier to research. I also made a small point, thrown in at the last minute, that really seemed to resonate with people: that open data has a dirtly little secret.

The secret? Open data requires work.

In all of the advocacy for open data, we often forget to account for the fact that making data openly available is not as easy as flipping a switch. Data needs to be cleaned so that it doesn’t contain extraneous information, streamlined to make things computable, and documented so that another researcher can understand it. On top of this, you must choose a license and take time to upload the data and its corresponding metadata. One researcher estimated that this process required 10 hours for a recently published paper, with significantly more time spent preparing his code for sharing.

But there is another secret here. It’s that data management reduces this burden.

Managing your data well means that a good portion of the prep work is done by the time you go to make the data open. This is done via spreadsheet best practices, data dictionaries, README.txt files, etc. Well managed data is already streamlined and documented and thus presents a lower barrier to making it open.

These issues are reinforced by the recently published “Concordat on Open Research Data“. Made up of 10 principles, these two in particular stuck out to me:

  • Principle 3: Open access to research data carries a significant cost, which should be respected by all parties.
  • Principle 6: Good data management is fundamental to all stages of the research process and should be established at the outset.

As we advocate for open data, Principle 3 reaffirms that we need to recognize the costs. But – as most things I blog about here – there is a solution and it’s managing your data better.