Roadmap project IDCC briefing
We had a spectacularly productive IDCC last month thanks to everyone who participated in the various meetings and events focused on the DMPRoadmap project and machine-actionable DMPs. Thank you, thank you! Sarah has since …
Author: Digital Curation Centre blogs
Roadmap retrospective: 2016
- More publishers articulated clear data policies, e.g., Springer Nature Research Data Policies apply to over 600 journals.
- PLOS now requires an ORCID for all corresponding authors at the time of manuscript submission to promote discoverability and credit.
- The Gates Foundation reinforced support for open access and open data by preventing funded researchers from publishing in journals that do not comply with its policy, which came into force at the beginning of 2017; this includes non-compliant high-impact journals such as Science, Nature, PNAS, and NEJM.
- Researchers throughout the world continued to circumvent subscription access to scholarly literature by using Sci-Hub (Bohannon, 2016).
- Library consortia in Germany and Taiwan canceled (or threatened to cancel) subscriptions to Elsevier journals because of open-access related conflicts, and Peru canceled over a lack of government funding for expensive paid access (Schiermeier and Rodríguez Mega, 2017).
- Reproducibility continued to gain prominence, e.g., the US National Institutes of Health (NIH) Policy on Rigor and Reproducibility came into force for most NIH and AHRQ grant proposals received in 2016.
- The Software Citation Principles (Smith et al., 2016) recognized software as an important product of modern research that needs to be managed alongside data and other outputs.
- Where and how do DMPs fit in the overall research lifecycle (i.e., beyond grant proposals)?
- Which data could be fed automatically from other systems into DMPs (or vice versa)?
- What information can be validated automatically?
- Which systems/services should connect with DMP tools?
- What are the priorities for integrations?
RDMF16 Research Software Breakout Session
(Katie Fraser from the University of Nottingham reports on the Research Software Preservation breakout session at the RDMF16 event, which took place in Edinburgh in late November 2016…)
The Research Software breakout session was proposed and facilita…
Finding our Roadmap rhythm
In keeping with our monthly updates about the merged Roadmap platform, here’s the short and the long of what we’ve been up to lately courtesy of Stephanie Simms of the DMPTool:
Short update
- Co-development on Roadmap codebase (current sprint)
- Adding documentation to DMPRoadmap GitHub wiki
- Machine-actionable DMPs
- Substance Forms: new text editor
- Themes: still collecting feedback!
- Data Documentation Initiative working group for DMP vocabulary
- PIDapalooza: PIDs in DMPs (9-10 Nov, Reykjavik)
- IDCC17 paper and workshop proposals submitted (20-23 Feb, Edinburgh)
- seeking funding to speed this work along
- Public DMPs: RIO Journal Collection and curating the DMPTool Public DMPs list
Long(er) update
This month our main focus has been on getting into a steady 2-week sprint groove that you can track on our GitHub Projects board. DCC/DMPonline is keen to migrate to the new codebase so in preparation we’re revising the database schema and optimizing the code. This clean-up work not only makes things easier for our core development team, but will facilitate community development efforts down the line. It also addresses some scalability issues that we encountered during a week of heavy use on the hosted instance of the Finnish DMPTuuli (thanks for the lessons learned, Finland!). We’ve also been evaluating dependencies and fixing all the bugs introduced by the recent Rails and Bootstrap migrations.
Once things are in good working order, DMPonline will complete their migration and we’ll shift focus to adding new features from the MVP roadmap. DMPTool won’t migrate to the new system until we’ve added everything on the list and conducted testing with our institutional partners from the steering committee. The CDL UX team is also helping us redesign some things, with particular attention to internationalization and improving accessibility for users with disabilities.
The rest of our activities revolve around gathering requirements and refining some use cases for machine-actionable DMPs. This runs the gamut from big-picture brainstorming to targeted work on features that we’ll implement in the new platform. The first step to achieving the latter involves a collaboration with Substance.io to implement a new text editor (Substance Forms). The new editor offers increased functionality, a framework for future work on machine-actionability, and delivers a better user experience throughout the platform. In addition, we’re refining the DMPonline themes (details here)—we’re still collecting feedback and are grateful to all those who have weighed in so far. Sarah and I will consolidate community input and share the new set of themes during the first meeting of a DDI working group to create a DMP vocabulary. We plan to coordinate our work on the themes with this parallel effort—more details as things get moving on that front in Nov.
Future brainstorming events include PIDapalooza—come to Iceland and share your ideas about persistent identifiers in DMPs!—and the International Digital Curation Conference (IDCC) 2017 for which registration is now open. We’ll present a Roadmap update at IDCC along with a demo of the new system. In addition, we’re hosting an interactive workshop for developers et al. to help us envision (and plan for) a perfect DMP world with tools and services that support FAIR, machine-actionable DMPs (more details forthcoming).
Two final, related bits of info: 1) we’re still seeking funding to speed up progress toward building machine-actionable DMP infrastructure; we weren’t successful with our Open Science Prize application but are hoping for better news on an IMLS preliminary proposal (both available here). 2) We’re also continuing to promote greater openness with DMPs; one approach involves expanding the RIO Journal Collection of exemplary plans. Check out the latest plan from Ethan White that also lives on GitHub and send us your thoughts on DMP workflows, publishing and sharing DMPs.
Getting our ducks in a row
Recent activity on the Roadmap project encompasses two major themes: 1) machine-actionable data management plans and 2) kicking off co-development of the shared codebase.
Image credit: ‘Get Your Ducks in a Row‘ CC-BY-SA by Cliff Johnson
Machine-actionable DMPs
The first of these has been a hot topic of conversation among stakeholders in the data management game for some time now, although most use the phrase “machine-readable DMPs.” So what do we mean by machine-actionable DMPs? Per the Data Documentation Initiative definition, “this term refers to information that is structured in a consistent way so that machines can be programmed against the structure.” The goal of machine-actionable DMPs, then, is to better facilitate good data management and reuse practices (think FAIR: Findable, Accessible, Interoperable, Reusable) by enabling:
- Institutions to manage their data
- Funders to mine the DMPs they receive
- Infrastructure providers to plan their resources
- Researchers to discover data
This term is consistent with the Research Data Alliance Active DMPs Interest Group and the FORCE11 FAIR DMPs group mission statements, and it seems to capture what we’re all thinking: i.e., we want to move beyond static text files to a dynamic inventory of digital research methods, protocols, environments, software, articles, data… One reason for the DMPonline-DMPTool merger is to develop a core infrastructure for implementing use cases that make this possible. We still need a human-readable document with a narrative, but underneath the DMP could have more thematic richness with value for all stakeholders.
A recent Cern/RDA workshop presented the perfect opportunity to consolidate our notes and ideas. In addition to the Roadmap project members, Daniel Mietchen (NIH) and Angus Whyte (DCC) participated in the exercise. We conducted a survey of previous work on the topic (we know we didn’t capture everything so please alert us to things we missed) and began outlining concrete use cases for machine-actionable DMPs, which we plan to develop further through community engagement over the coming months. Another crucial piece of our presentation was a call to make DMPs public, open, discoverable resources. We highlighted existing efforts to promote public DMPs (e.g., the DMPTool Public DMPs list, publishing exemplary DMPs in RIO Journal) but these are just a drop in the bucket compared to what we might be able to do if all DMPs were open by default.
You can review our slides here. And please send feedback—we want to know what you think!
Let the co-development begin!
Now for the second news item: our ducks are all in a row and work is underway on the shared Roadmap codebase.
We open with a wistful farewell to Marta Ribeiro, who is moving on to an exciting new gig at the Urban Big Data Centre. DCC has hired two new developers to join our ranks—Ray Carrick and Jimmy Angelakos—both from their sister team at EDINA. The finalized co-development team commenced weekly check-in calls and in the next week or two we’ll begin testing the draft co-development process by adding three features from the roadmap:
- Enhanced institutional branding
- Funder template export
- OAuth link an ORCID
In the meantime, Brian is completing the migration to Rails 4.2 and both teams are getting our development environments in place. Our intention is to iterate on the process for a few sprints, iron out the kinks, and then use it and the roadmap as the touchstones for a monthly community developer check-in call. We hope this will provide a forum for sharing use cases and plans for future work (on all instances of the tool) in order to prioritize, coordinate, and alleviate duplication of effort.
The DCC interns have also been plugging away at their respective projects. Sam Rust just finished building some APIs on creating plans and extracting guidance, and is now starting work on the statistics use case. Damodar Sójka meanwhile is completing the internationalization project, drawing from work done by the Canadian DMP Assistant team. We’ll share more details about their work once we roll it back into the main codebase.
Next month the UC Berkeley Web Services team will evaluate the current version of DMPonline to flag any accessibility issues that need to be addressed in the new system. We’ve also been consulting with Rachael Hu on UX strategy. We’re keeping track of requests for the new system and invite you to submit feedback via GitHub issues.
Stay tuned to GitHub and our blog channels for more documentation and regular progress updates.
#IDCC16: Atomising data: Rethinking data use in the age of explicitome
Data re-use is an elixir for those involved in research data.
Make the data available, add rich metadata, and then users will download the spreadsheets, databases, and images. The archive will be visited, making librarians happy. Datasets will be cited, making researchers happy. Datasets may be even re-used by the private sector, making university deans even happier.
But it seems to me that data re-use, or at least a particular conceptulisation of re-use as is established in most data repositories, is not the definitive way of conceiving of data in the 21st century.
Two great examples from the International Data Curation Conference illustrated this.
Barend Mons declared that the real scientific value in scholarly communication is not abstracts, articles or supplementary information. Rather the data that sits behinds these outputs is the real oil to be exploited, featuring millions of assertions about all kinds of biological entities.
Describing the sum of these assertions as the explicitome, it enables cross fertilisation between distinct scientific work. With all experimental data made available in the explicitome, researchers taking an aerial view can suddenly see all kinds of new connections and patterns between entities cited in wholly different research projects.
Secondly, Eric Kansa’s talk on the Open Context framework for publishing archaeological data. Following the same principle as Barend Mons, OpenContext breaks data down into individual items. Instead of downloading a whole spreadsheet relating to a single excavation, you can access individual bits of data. From an excavation, you can see the data related to a particular trench, and then items discovered in that trench.
A screenshot from OpenContext
In both cases, data re-use is promoted, but in an entirely different way to datasets being uploaded to an archive and then downloaded by a re-user.
In the model proposed by Mons and Kansa, data is atomised, and then published. Each individual item, or each individual assertion, gets it own identity. And that piece of data can then easily be linked to other relevant pieces of data.
This hugely increases the chance of data re-use; not whole datasets of course, but tiny fractions of datasets. An archaeologist examining remains of jars on French archaeological sites might not even think to look at a dataset from a Turkish excavation. But if the latter dataset is atomised in a way that it allows it identify the presence of jars as well, then suddenly that element of the Turkish dataset will become useful.
This approach to data is the big challenge for those charged with archiving such data. Many data repositories, particularly institutional ones, store individual files but not individual pieces of data. How research data managers begin to cope with the explicitome – enabling it, nourishing and sustaining it – may well be a topic of interest for IDCC17.
#IDCC16: Strategies and tactics in changing behaviour around research data
The International Data Curation Conference (IDCC) continues to be about change. That is, how do we change the eco-system so that managing data is an essential component of the research lifecycle? How can we free the rich data trapped in PDFs or lost to linkrot? How can we get researchers to data mine and not data whine?
While, for some, the pace of change is not quick enough, IDCC still demonstrates an impressive breadth of strategy and tactics to enable this change.
On the first day of the conference, Barend Mons set out the vision. The value of research is not in journals but in the underlying data – thousands and thousands of assertions about genes, bacteria, viruses, proteins, indeed any biological entity are locked in figures and tables. Release such data and the interconnections between related entities in different datasets reveals whole new patterns. How to make this happen? One part of the solution: all projects should allocate 5% of their budget to data stewardship.
Andrew Sallans of the Center for Open Science followed this up with their eponymous platform for managing Open Science for linking data to all kinds of cloud providers and (fingers crossed) institutions’ data repositories. In large-scale projects, sharing and versioning data can easily get out of control; the framework helps to manage this process more easily. They have some pretty nifty financial incentives to change practice too – $1000 awards for pre-registration of research plans.
Following this we saw many posters – tactics to alter behaviours of individuals and groups of researchers. There were some great ideas here, such as plans at the University of Toronto to develop packages of information for librarians on data requirements of different disciplines.
Despite this, my principal concern was the huge gap between the massive sweep of the strategic visions and the tactics for implementing change. Many of the posters were valiant but were locked in an institutional setting – the libraries wrestling how to influence faculty without the in depth knowledge (or institutional clout) to make winning arguments within a particular area.
What still seems to be missing from IDCC is the disciplinary voice. How are particular subjects approaching research data? How can the existing community work more closely with them? There was one excellent presentation on building workflows for physicists studying gravitational waves; and other results from OCLC work with social scientists and zoologists. But in most cases it was us librarians doing the talking rather than it being a shared platform with the researchers. If we want that change to happen, there still needs to be greater engagement with the subjects that are creating the research data in the first place.
RDMF14: report from Breakout Group 2 (Systems Integration)
This breakout group was a discussion on the challenges of integrating systems for research data management. It was chaired by Rory McNicholl.
The group was asked to give examples of systems that could be integrated with research data infrastructure.
So…
Where are they now? An RDM update from the University of Glasgow
A guest blog post by Mary Donaldson, Research Data Management Services Co-ordinator, University of Glasgow.
Over recent years, central support for research data management (RDM) at the University of Glasgow has been limited. The JISC-funded C4D project which ran until September 2013 provided some basic support, which was augmented with expert advice from the Digital Curation Centre (DCC). We used our DCC institutional engagement to assist with the formulation of our draft Institutional Data Policy and our Engineering and Physical Science Research Council (EPSRC) Roadmap and for help with RDM training. The DCC also helped run a Data Asset Framework (DAF) survey and associated follow-up interviews which allowed us to assess current RDM practices in the University. Joy Davidson and Sarah Jones were invaluable for their support in the early development of RDM awareness at Glasgow.
Between the end of the initial formal engagement with the DCC and late 2014, work to develop and promote RDM at Glasgow has been on an ad hoc basis. In late 2014, the University of Glasgow began appointing an RDM team to run its institutional RDM service and research data registry and repository. The team currently comprises an RDM officer with responsibility for the technical side of operations and an RDM officer with responsibility for the coordination of the service. In June 2015, our team will be complete when a third member, an RDM officer with responsibility for staff training and support, joins us. The RDM team has been working systematically to develop the RDM service on several fronts:
Registry and Repository:
As part of the Cerif for Datasets (C4D) project, Glasgow set up a fledgling Research Data Registry (http://researchdata.gla.ac.uk) using EPrints Repository software. The Registry uses a metadata specification developed during the C4D project working with other EPrints sites to agree standard functionality. The data registry has various functionality to help researchers manage their research data, including the capability to mint Digital Object Identfiers (DOIs) for data.
Following the appointment of the Glasgow RDM team in late 2014, the Registry has been augmented by data archiving capability provided by Arkivum, and using the Arkivum EPrints plugin to seamlessly link the Registry with off-site Arkivum storage.
Plans for the coming months include: linking research data with research publications and theses, building on work carried out by University of London Computer Centre and University of East London; linking research data with University staff profile pages; enhanced Registry front-end to include a responsive design so that researchers can use the registry across multiple devices; and iterative development of data ingest and curation procedures as more datasets come through the Registry and workflows are tested and revised.
Researcher Engagement:
With the looming 1st May 2015 deadline for the EPSRC expectations, the majority of our researcher engagement activities this year have been focussed on EPSRC-funded researchers. We have been contacting EPSRC-funded researchers to offer face-to-face meetings with a member of the RDM team to clarify funder expectations and to explain how the RDM service can help. Through these meetings, we have identified a few examples of really good practice and have cultivated these researchers as ‘data management champions’ who are willing to speak at RDM engagement events about how they go about RDM activities. We also take opportunities to go and speak at researcher gatherings to raise the profile of RDM within the University’s research community. In addition to our proactive work with EPSRC-funded researchers, our services are also available to all other research staff in the University. In recent weeks, with the release of the ESRC Data Policy, we have been looking at ways in which we can engage with ESRC-funded researchers at Glasgow. We anticipate that as compliance with the EPSRC and ESRC requirements becomes part of the normal research workflow, we’ll turn our attention to other RCUK-funded researchers. We are also working with the Open Access Service to coordinate our service offerings and to reduce the number of emails being received by research staff.
Training Offering:
Recently we have been working on extending our researcher training offering to make sure we cover all aspects of RDM and the data lifecycle, and make this training available to researchers at all stages of their careers.
Through the Staff Development Service, we will be increasing access opportunities to the existing workshop- ‘Managing Research Data’ and we will also be offering a new workshop – ‘Data Management Planning’. We will also be contributing appropriate material to several other workshops run by the Staff Development Service.
Through the Graduate Schools, we will be offering workshops on Research Data Management for Postgraduate researchers. We will also be contributing appropriate material to other training courses offered by the Graduate Schools.
In addition we will be delivering training to staff and student groups within the University such as the Early Career Researcher Fellowship Application Mentoring Group.
Service coordination:
We are continuing to work with other services at Glasgow to ensure that consideration is given to RDM at the appropriate times in a research project lifecycle. With the contracts team we have agreed wording for collaboration agreements that makes provision for data sharing at the end of the project. We are also working with the University Ethics Committee to inform them of RDM considerations that they might need to take into account when considering applications, and with the Research Support Office to get researchers to complete a Data Management Plan for their project and to cost for RDM when making funding bids.
We have also made two successful proposals to the Research Strategy and Planning Committee:
- To put the responsibility for the quality assurance of our data curations processes within the remit of our Vice Principal for Research and Enterprise.
- To strongly encourage all researchers within the University of Glasgow to prepare data management plans for their projects, regardless of whether this is required by their funders as part of the application process.
Future aspirations:
- To ensure all researchers have access to support to facilitate compliance with funder requirements and good data management practice?
- To extend the University of Glasgow-specific guidance in DMPonline.
- To fully embed data management and planning into the normal workflow of researchers at Glasgow.
- To update our training offering and resources with examples of best RDM practice from within our own research community.
Confessions of a Digital Archivist
I have a confession to make, I’m not from a University and I don’t have any Research Data Management experience. So why exactly did I attend IDCC this year?