DOIs and the danger of data “quality”

I’ve just spent a moment looking at guidelines [PDF] from the UK’s National Environment Research Council (NERC) on how NERC funded research can obtain a persistent identifier through the DOI® system.

Just DOI it, just don't DOI it like that.

Just DOI it, just don’t DOI it like that.

NERC have a data sharing policy, and fund data centres for sharing and long-term data preservation. Like us here at GESIS, they have an interest in promoting stable persistent identifiers (in both cases Digital Object Identifier (DOI) names) that allow datasets to be cited as one would a publication. All well and all good.

I certainly have no issue with the advice they provide for researchers on obtaining a DOI name. Its good, clear, and concise. However, I’m going to expand on my reaction to one line in their guidance document. NERC state “by assigning a DOI the [Environmental Data Centre] are giving it a ‘data center stamp of approval’”. Effectively they see a DOI name (or by implication any other form of Persistent Uniform Resource Locator (PURL)) as a quality check-mark in addition to its role as a reference to an object. Except the DOI system isn’t designed to suggest the “quality goes in before the name goes on”. Just to remind myself, I quickly looked at the International DOI Foundation handbook and it doesn’t mention data quality. Identification, yes. Resolution, yes. Management, yes. Quality, no.

There is no standardized quality symbol for data themselves. Instead we have informal ones that act as proxies – not of concern to researchers themselves but do closely correlate to the contestable idea of “quality”. But remember, they remain proxies, not the variable of interest. For example, just because a data set is available from a social science data archive doesn’t mean it is any good. It means the archive think people will use it (or that we are contractually obliged to take it), it can be understood and isn’t just a set of numbers, doesn’t violate data protection laws or intellectual property rights, and doesn’t break our will or budget taking the data into our collection. So, if you order data from a data archive it is preserved and contextualized and probably good quality data, but it need not be good quality. Indeed, I suspect most archives have a data set or two that somehow ended up accepted into the collection as the result of an impenetrable act of madness or despair. Likewise receiving a DOI name might be a stamp of approval if minted by a NERC data center, but other assigners might not be so fussed about the quality of what’s getting DOIed. As this blog post reminds us, anything can be given a DOI, multiple times.

Now, archives are working towards establishing their own stamps of approval for digital preservation and archiving. The Data Seal of Approval, nestor Seal for Trustworthy Digital Archives, and ISO 16363 standards are recognized levels. Yet, these are explicit symbols of quality in digital preservation – showing an archive knows what to do, how to do it, and are doing it. Thus it indicates the quality of curation, not the quality of the data being curated. The best preserved and contextualized data set in the world could still be junk.

So should we as a community be moving towards quality symbols for data themselves? The risk of starting down that route is encountering a host of problems defining contestable notions of “quality”. Digital preservation is, after all, measurable in the sense something is both preserved and accessible or it isn’t. However, research data  is subject to all kinds of challenges as to its quality, even to the point of dismissing entire research approaches. I have no problem with NERC specifying their own concept of quality (which they effectively do), it’s just the use of DOI names as a tool to signify this. To that effect, we shouldn’t start using tools designed for one end to another.