The European Commission recently held a public consultation on open research data. The President of DataCite, Adam Farquhar, had the opportunity to speak and highlighted the importance of identifying and citing research data. There were numerous DataCite members and data centres present.
Adam introduced the organization and highlighted how our data centres have assigned over 1.7 million DOIs of which over 270 thousand were assigned this year.
Just a few years ago, Adam explained, identification was dominated by local, national, or disciplinary initiatives. It has now matured substantially with the growth of international cross-disciplinary organizations such as DataCite. In other areas, we are also seeing researcher identifiers in ORCID and article identifiers from CrossRef.
There is widespread consensus that identification and citation-level metadata are essential to making data accessible, re-usable, and to establish incentive systems to encourage data sharing.
He also explained some lessons learned by DataCite over the last few years:
- Data identification requires interoperable APIs and metadata. We’ve worked with CrossRef to enhance the DOI APIs to support content negotiation for better machine-to-machine interactions. The DataCite Metadata Schema provides cross-disciplinary citation-level metadata for research data. It has been adopted by others, such as OpenAire, and supports third party services.
- While data identification has some distinct requirements, it has been possible to enhance existing approaches, such as DOI, to meet them. For example, we’ve worked with IDF so that their business model is now better suited to research data requirements.
- Together, open infrastructure, metadata, and APIs enable third parties to build enhanced services including commercial organizations, e.g. Thomson Reuters, and publishers, e. g. Thieme and Elsevier. They also enable repositories like Dryad, Pangaea, FigShare, and Zenodo to ensure that the data they hold is identified and citable for the long term.
- Data identification is more than assigning a number. Success requires robust services, robust policies, and a strong community of practice. It also means that allocating agents and data stewards must establish formal, often contractual, relationships with the long term in mind. Without these essential steps, data identification becomes just another breeding ground for 404 errors – data not found.
An open approach has also enabled us to work on the broader identifier challenges through collaborations. The DataCite-STM statement encouraged bi-directional links between articles and data. Through ODIN – the ORCID and DataCite Interoperability Network we will learn how to link authors with their articles, data, and more.
And, while challenges remain, we have a very strong basis for an interoperable identification infrastructure - one that weaves data, articles, and researchers together into a new fabric of open research.
During the meeting there was strong consensus on the essential role that data identification and citation play, as well as on the need for data management plans. There were also some areas of disagreement. Some industry representatives argued against the need for data to be open or to have ‘open’ be the default setting. There was also robust discussion on the appropriate size of data repositories – some argued for large scale, others for many small ones.