Self-archiving platforms and data verification

There used to be a comedy show on TV that featured a character who described everything as either “brilliant!” or “fantastic!” Isn’t open data, brilliant! Data sharing, brilliant! Expanding ways to facilitate open data and sharing, fantastic! And, you know what, it is! Transparency is brilliant, accountability is fantastic, and advancing scientific knowledge is, yes, fantastic and brilliant!

Automated data sharing platforms, or self-archiving, that make it as easy possible for researchers to share their data by providing a Dropbox, YouTube, or Flickr like interface are fantastic facilitators for sharing as they provide a platform for discovery and get data to their community quickly and easy in a world where “discovery”, “quick” and “easy” are essential elements. For the most part this is not a problem and is, well, brilliant, or even well brilliant.

But (and of course there’s a “but” as I have another 800 words for you) the fantastic move to open data and exchange of data is occurring in a time of increasing fear about sharing personal data. What’s personal data? It’s data relating to a living individual who can be identified either from those data or from those data in combination with other available information.

Data protection laws tend to get stronger rather than weaker, the grumbling around data social networks require and what they do with the data we give them gets louder, and of course, the revelations as to just how much governments (not even our own government) pry into our lives generates increasing outrage. Not so brilliant. Certainly not fantastic.

So, what happens when our desire to share scientific data quickly and easily comes up against a need to protect personal data?

Here’s an example based, as they say in Hollywood, on a true story. A self-archiving platform is launched welcoming research outputs from across the sciences. It’s established on the principle of making it as easy possible for researchers to share their through an easy to use interface and quick uploading of data in almost any file format. From an IT perspective, it’s brilliant! Anything that makes research data management and sharing less of a chore to do must be attractive, right? Especially when the alternative is that data is lost forever. However, social science archiving nearly always has a problem with the “quick” and “easy”. Compared to other data sources we rightly are neither quick nor easy when it comes to research based on human subjects that could contain personal data.

Archives — like self-archiving platforms — take data on a basis of trust. We trust the people offering data are telling the truth when they claim they have the right to offer us data, and we trust researchers when they tell us anonymisation and personal data issues have been addressed to a mutually agreed standard. Occasionally through naiveté or a genuine mistake researchers may give us data that violates that trust. However, an archive, compared to most self-archiving platforms will have a data ingest procedure that includes a manual data verification process and safeguards to ensure personal data laws are respected and re-identification of participants prevented to the point where “disproportionate amount of time, expense and effort” is required [Germany] or “dentification is not likely to take place” [UK]. Now, we often don’t do the work on personal data and anonymisation, but we do a lot of manual work making sure researchers have done what they said they did and that it is addressed before we make data available to the research community or others.

However, if that ingest and verification process is automated and supported with a loose policy towards data acquisition and ingest, these potential problems with personal data and identification only get caught on becoming realized rather than hypothetical problems. If you accept any file format that makes it harder to verify the contents. If you promise instant data availability that makes it impossible to verify anonymisation and data checking has taken place. Therefore if something sneaks through that shouldn’t, it doesn’t matter if you take it offline immediately. Like a misjudged celebrity tweet, the fact it appeared at all is enough to do damage.

This is bad for two reasons. There is the obvious legal issue and potential punishments, which can be severe. But the other reason is trust. Trust is our currency. This archive thing of ours only works, and can only work, on a basis of trust. In fact, the social science archiving community have even have adopted standards to support the value of trust. We have to trust what researchers give us; they have to trust we can look after it. Users have to trust the quality and contents of what we give them, we have to trust users respect the terms of use under which we provide them data. Trust is precious. A violation of that trust, either wilful or not, leads to a devaluation in the currency of trust and this is not brilliant nor is it fantastic. It is awful.

Data infrastructures have a lot in common but what this episode suggests is that data archiving and sharing isn’t just an IT issue. Sure, we need platforms that are easy to use and get data to people as quickly as possible. However, we need policies, procedures and expertise underpinning these platforms. We can’t just take anything, take it all, let it sort itself out, and only act when a problem is pointed out by the depositor or user community. Trying to retrospectively moderate data you may not be able to read or understand is not a “policy”; it’s an invitation for trouble. At some point you’re still in need of a human touch.