Making Dryad More Data Science Friendly

Daniella Lowenberg & Karthik Ram

As we enter year three of the pandemic, it has become clear that many aspects of our lives have permanently changed. Travel and fieldwork, especially in remote locations, were never easy to begin with. Now, these efforts have become much more challenging to organize and execute, serving as a constant reminder that the data we collect must be carefully curated and reliably made available to future researchers. 

The Dryad team has been making steady improvements to platform infrastructure for many years, kicking off with the  CDL partnership and platform re-launch in 2019 . Through various outdoor meetings during lockdowns, we explored various ways to make Dryad even more researcher friendly, especially in the context of data quality and data reuse. The last years of Dryad integrations have been so heavily focused on submission in line with goals to increase awareness and feasibility of publishing data: publisher integrations, integration with Zenodo for software and supplementary information, tabular data checks with Frictionless data. These integrations have been greatly powerful and necessary for supporting research data publishing. Now, it’s time to focus that level of investment on researcher reuse of Dryad datasets.

In Q2 of 2022, we carried out a detailed analysis of the Dryad corpus and the API. Dryad hosts more than a million data files across over 48,000 data publications. Tabular data files (csv, tsv, and Excel) make up at least 30% of the submission (far more are in compressed files), followed by various image formats, and miscellaneous supporting files (scripts, notes, and readme files). At least 13% of files are opaque zipped files that contain collections of tabular or fasta files. Usage instructions were sparse and README files historically were poorly structured. 

In 2021 Dryad partnered with the Frictionless project to run data validation across all new submissions. An analysis of 46,823 tabular files revealed that 85% of the files didn’t have any obvious validation issues, 10% with problems, and 4% with more serious errors. Dryad continues to run Frictioness validation during the submission process but doesn’t yet enforce compliance before submission. 

From these results and from listening to various research communities it’s clear that with any data publisher, and especially with Dryad, the value needs to lie in the usability of published datasets. Dryad has put a plan in place to improve data quality at submission, a time when researchers are best equipped to address any problems with their datasets. We have also put a plan in motion to make substantial changes to the API and the interface. In the future, we will explore feature sets around file manifests, tabular file previews, rendered READMEs, README templates, and much more.  

The last decade has proved that it’s possible to get mass adoption of researchers to comply with open data policies: tossing their data over a wall to the repository, including a data availability statement (rarely with a data citation – insert Daniella’s many rants some of which are available here), and feeling like they’ve met the mandate. But at what point is this useful? It’s not if the data aren’t being reused and especially if the data are not able to be reused. 

Dryad’s mission remains to advance scientific discovery through curated open data access and driving this forward we will be focusing on feature sets centered on: reusability, machine usability, and pluggability. This includes aligning with popular data science tools, educating researchers along the submission process with more complex checks and automated tooling for quality, and rethinking how users access and compute with data published in Dryad. 

As the adoption of executable notebooks becomes more mainstream in the research community, Dryad is committed to meeting these researchers where they are headed, with a data-science-friendly research repository.