Every day, people from around the world upload photos to share on a range of social media sites and web applications. The results are astounding; collections of billions of digital photographs are now stored and managed by several companies and organizations. In this context, Yahoo Labs recently announced that they were making a data set of 100 million Creative Commons photos from Flickr available to researchers. As part of our ongoing series of Insights Interviews, I am excited to discuss potential uses and implications for collecting and providing access to digital materials with David Ayman Shamma, a scientist and senior research manager with Yahoo Labs and Flickr.
Trevor: Could you give us a sense of the scope and range of this corpus of photos? What date ranges do they span? The kinds of devices they were taken on? Where they were taken? What kinds of information and metadata they come with? Really, anything you can offer for us to better get our heads around what exactly the dataset entails.
Ayman: There’s a lot to answer in that question. Starting at the beginning, Flickr was an early supporter of the Creative Commons and since 2004 devices have come and gone, photographic volume has increased, and interests have changed. When creating the large-scale dataset, we wanted to cast as wide a representative net as possible. So the dataset is a fair random sample across the entire corpus of public CC images. The photos were uploaded from 2004 to early 2014 and were taken by over 27,000 devices, including everything from camera phones to DSLRs. The dataset is a list of photo IDs with a URL to download a JPEG or video plus some corresponding metadata like tags and camera type and location coordinates. All of this data is public and can generally be accessed from an unauthenticated API call; what we’re providing is a consistent list of photos in a large, rolled-up format. We’ve rolled up some but not all of the data that is there. For example, about 48% of the dataset has longitude and latitude data which is included in the rollup, but comments on the photos have not been included, though they can be queried through the API if someone wants to supplement their research with it.
Trevor: In the announcement about the dataset you mention that there is a 12 GB data set, which seems to have some basic metadata about the images and a 50 TB data set containing the entirety of the collection of images. Could you tell us a bit about the value of each of these separately, the kinds of research both enable and a bit about the kinds of infrastructure required to provide access to and process these data sets?
Ayman: Broadly speaking, research on Flickr can be categorized into two non-exclusive topic areas: social computing and computer vision. In the latter, one has to compute what are called ‘features’ or pixel details about luminosity, texture, cluster and relations to other pixels. The same is true for audio in the videos. In effect, it’s a mathematical fingerprint of the media. Computing these fingerprints can take quite a bit of computational power and time, especially at the scale of 100 million items. While the core dataset of metadata is only 12 GB, a large collection of features reach into the terabytes. Since these are all CC media files, we thought to also share these computed features. Our friends at the International Computer Science Institute and Lawrence Livermore National Labs were more than happy to compute and host a standard set of open features for the world to use. What’s nice is this expands the dataset’s utility. If you’re from an institution (academic or otherwise), computing the features could be a costly set of compute time.
Trevor: The dataset page notes that the dataset has been reviewed to meet “data protection standards, including strict controls on privacy.” Could you tell us a bit about what that means for a dataset like this?
Ayman: The images are all under one of six Creative Commons licenses implemented by Flickr. However, there were additional protections that we put into place. For example, you could upload an image with the license CC Attribution-NoDerivatives and mark it as private. Technically, the image is in the public CC; however, Flickr’s agreement with its users supersedes the CC distribution rights. With that, we only sampled from Flickr’s public collection. There are also some edge cases. Some photos are public and in the CC but the owner set the geo-metadata to private. Again, while the geo-data might be embedded in the original JPEG and is technically under CC license, we didn’t include it in the rollup.
Trevor: Looking at the Creative Commons page for Flickr, it would seem that this isn’t the full set of Creative Commons images. By my count, there are more than 300 million creative commons licensed photos there. How were the 100 million selected, and what factors went into deciding to release a subset rather than the full corpus?
Ayman: We wanted to create a solid dataset given the potential public dataset size; 100 million seemed like a fair sample size that could bring in close to 50% geo-tagged data and about 800 thousand videos. We envision researchers from all over the world accessing this data, so we did want to account for the overall footprint and feature sizes. We’ve chatted about the possibility of ‘expansion packs’ down the road, both to increase the size of the dataset and to include things like comments or group memberships on the photos.
Trevor: These images are all already licensed for these kinds of uses, but I imagine that it would have simply been impractical for someone to collect this kind of data via the API. How does this data set extend what researchers could already do with these images based on their licenses? Researchers have already been using Flickr photos as data, what does bundling these up as a dataset do for enabling further or better research?
Ayman: Well, what’s been happening in the past is people have been harvesting the API or crawling the site. However, there are a few problems with these one-off research collections; the foremost is replication. By having a large and flexible corpus, we aim to set a baseline reference dataset for others to see if they can replicate or improve upon new methods and techniques. A few academic and industry players have created targeted datasets for research, such as ImageNet from Stanford or Yelp’s release of its Phoenix-area reviews. Yahoo Labs itself has released a few small targeted Flickr datasets in the past as well. But in today’s research world, the new paradigm and new research methods require large and diverse datasets, and this is a new dataset to meet the research demands.
Trevor: What kinds of research are you and your colleagues imagining folks will do with these photographs? I imagine a lot of computer science and social network research could make use of them. Are there other areas you imagine these being used in? It would be great if you could mention some examples of existing work that folks have done with Flickr photos to illustrate their potential use.
Ayman: Well, part of the exciting bit is finding new research questions. In one recent example, we began to examine the shape and structure of events through photos. Here, we needed to temporally align geo-referenced photos to see when and where a photo was taken. As it turns out, the time the photo was taken and the time reported by the GPS are off by as much as 10 minutes in 40% of the photos. So, in work that will be published later this year, we designed a method for correcting timestamps that are in disagreement with the GPS time. It’s not something we would have thought we’d encounter, but it’s an example of what makes a good research question. With a large corpus available to the research world at-large, we look forward to others also finding new challenges, both immediate and far-reaching.
Trevor: Based on this, and similar webscope data sets, I would be curious for any thoughts and reflections you might offer for libraries, archives and museums looking at making large scale data sets like this available to researchers. Are there any lessons learned you can share with our community?
Ayman: There’s a fair bit of care and precaution that goes into making collections like this - rarely is it ever just a scrape of public data; ownership and copyright does play a role. These datasets are large collections that reflect people’s practices, behavior and engagement with media like photos, tweets or reviews. So, coming to understand what these datasets mean with regard to culture is something to set our sights on. This applies to the libraries and archives that set to preserve collections and to researchers and scientists, social and computational alike, who aim to understand them.