Citation Omitted: A Story of Re-identification

I published the article “Data Management Practices in Academic Library Learning Analytics: A Critical Review” in 2019. Eagle-eyed readers may have noticed that I omitted a couple of citations, instead listing them as “citation omitted in order to protect students’ identities.” This is because the two studies in question published student details so identifying that it would be possible to attach names to those individuals. Of the two studies, one was incidentally identifying and one was egregiously identifying. This blog post will talk about the egregious study, this study:

Murray, A., Ireland, A., & Hackathorn, J. (2016). The Value of Academic Libraries: Library Services as a Predictor of Student Retention. College & Research Libraries, 77(5), 631–642.

I had not publicly identified the study until January 6 of this year when, in my frustration about CR&L publishing yet another study with privacy concerns and the simultaneous unfolding of American democracy, I vented out this Twitter rant:

Then, to properly explain my concerns about privacy and re-identification in the article, I followed up with this Twitter thread:

I’ve had several requests to turn those two Twitter threads into a citeable blog post, so here we are.

I have two privacy concerns with the study. My major concern centers on Table 1, which lists study participants to include: 2 Native American freshman, 3 Native American sophomores, 1 Pacific Islander freshman, and 1 Pacific Islander sophomore. The study also lists age range of participants from 17 to 83, meaning the oldest participant is 83 years old. By including this specific information, the article basically identified several students even without giving us their names.

In research this is called “n=1”, meaning that you’ve divided up demographics so much that you identify single people. It’s definitely not something that should be done when publishing research results. The individuals in this example are even more identifiable as they come from minority student populations (with examples of both race and age minorities), so it’s bad on two fronts.

If I was a part of the university where the study was conducted, just knowing “83 year-old student” or “Pacific Islander sophomore” may be enough for me to come up with specific names because I’m familiar with the student body. As an outsider, it’s still a rather trivial process to go from n=1 identifiers to names.

Let’s take the “Pacific Islander sophomore” and work through the thought example (I’m not actually going to find a name, just talk about the process). We’ll pull in an outside dataset to make this work, in this case IPEDS. IPEDS in a national database that collects statistics on every U.S. academic institution. One of the statistics IPEDS collects is completions by year in different majors broken down by racial demographics, aka. the “Completions” table. So now I can look up the university, look up the year, and find my single Pacific Islander to discover their major. Then it becomes a matter of visiting the department webpage or Facebook or the graduation program to get a list of names corresponding to that major in that year. Finally, using context and other available data I can whittle the names down to a likely candidate. The person’s minority status makes them easier to identify here, especially if they have a non-White name or do not pass for White in departmental photos. This whole process may take 30 minutes or so and uses information that is freely available on the web.

Coming back to the study, while putting a name to a study participant does not tell me what that student did in the library, it’s still not acceptable for the article to identify them. And it’s not okay that these issues slipped past peer reviewers and editors. And, when I contacted the editor about correcting the issue, it’s not okay that nothing was done (the conclusion was basically “it’s bad but not bad enough to merit correction”).

So, seeing as this is a blog on better data practices, what should be done instead? Whenever you have small populations, think carefully before you report data about those people. There’s no hard rule of thumb for size but consider: warnings at under 20, red flags at less than 10, and full stop for under 5. There are two common options for dealing with small populations: aggregate small subgroups into one “Other” group to add up small numbers into a larger number (e.g. there are 33 Asian, Native American, and Pacific Islanders in this study); or obscure the small/outlier number values (e.g. “<5 Pacific Islanders” or “>65 years old”). Be aware that the first option can hide the existence of minority racial populations by erasing their representation in the data, so be thoughtful to balance representation with privacy.

The second thing that needs to be done is that we all need to be better at identifying and calling out these problems when we see them, especially peer reviewers and editors. I know I get on my soapbox periodically about “anonymization” versus “de-identification”, but it’s because many people fundamentally don’t understand the difference. We need to learn that datasets about people are never anonymous and that we should always operate from the perspective that they can be re-identified.

Finally, I won’t deny that there are a lot of power dynamics in play for why I haven’t told this story previously. I didn’t want to identify the article, and thereby identify the students, as the students have no power in this situation and didn’t ask to be identified just because they used the library. I was also leery, as a somewhat new librarian, of calling out one of the field’s preeminent journals. I have now done both because it’s important for people to understand just how easy it is to re-identify people from scant published information. I do this not to rehash the past but because I want people to do better going forward. So go, do better, and never publish n=1 again.

Thank you Dorothea and Callan and everyone else who suggested that this be a blog post.