The following is a guest post from Julia Fernandez, this year’s NDIIPP Junior Fellow. Julia has a background in American studies and working with folklife institutions and worked on a range of projects leading up to CurateCamp Digital Culture last week. This is part of an ongoing series of interviews Julia is conducting to better understand the kinds of born-digital primary sources folklorists, and others interested in studying digital culture, are making use of for their scholarship.
What can a Yelp review or a single tweet reveal about society? How about hundreds of thousands of them? In this installment of the Insights Interviews series, I’m thrilled to talk with researcher Bryan Routledge about two of his projects that utilize a computational linguistic lens to analyze vast quantities of social media data. You can read the article on word choice used in online restaurant reviews here. The article about using Twitter as a predictive tool as compared with traditional public opinion polls here (PDF).
Julia: The research group Noah’s ARK at the Language Technologies Institute, School of Computer Science at Carnegie Mellon University aims in part to “analyze the textual content of social media, including Twitter and blogs, as data that reveal political, linguistic, and economic phenomena in society.” Can you unpack this a bit for us? What kind of information can social media provide that other kinds of data can’t?
Bryan: Noah Smith, my colleague in the school of computer science at CMU, runs that lab. He is kind enough to let me hang out over there. The research we are working on looks at the connection between text and social science (e.g., economics, finance). The idea is that looking at text through the lens of a forecasting problem — the statistical model between text and some social-science measured variable — gives insight into both the language and social parts. Online and easily accessed text brings new data to old questions in economics. More interesting, at least to me, is that grounding the text/language with quantitative external measures (volatility, citations, etc.) gives insight into the text. What words in corporate 10K annual reports correlate with stock volatility and how that changes over time is cool.
Julia: Your work with social media—Yelp and Twitter—are notable for their large sample sizes and emphasis on quantitative methods, using over 900,000 Yelp reviews and 1 billion tweets. How might archivists of social media better serve social science research that depends on these sorts of data sets and methods?
Bryan: That is a good question. What makes it very hard for archivists is that collecting the right data without knowing the research questions is hard. The usual answer of “keep everything!” is impractical. Google’s n-gram project is a good illustration. They summarized a huge volume of books with word counts (two word pairs, …) by time. This is great for some research. But not for the more recent statistical models that use sentences and paragraph information.
Julia: Your background and most of your work is in the field of finance, which you have characterized as being fundamentally about predicting the behavior of people . How do you see financial research being influenced by social media and other born digital content? Could you tell us a bit about what it means to have a financial background doing this kind of research? What can the fields of finance and archives learn from each other?
Bryan: Finance (and economics) is about the collective behavior of large number of people in markets. To make research possible you need simple models of individuals. Getting the right mix of simplicity and realism is age-old and ongoing research in the area. More data helps. Macroeconomic data like GDP and stock returns is informative about the aggregate. Data on, say, individual portfolio choices in 401K plans lets you refine models. Social media data is this sort of disaggregated data. We can get a signal, very noisy, about what is behind an individual decision. Whether that is ultimately helpful for guiding financial or economic policy is an open, but exciting, question.
More generally, working across disciplines is interesting and fun. It is not always “additive.” The research we have done on menus has nothing to do with finance (other than my observation that in NY restaurants near Wall Street, the word “baby” is associated with expensive menu items). But if we can combine, for example, decision theory finance with generative text models, we get some cool insights into purposefully drafted documents.
Julia: The data your team collected from Yelp was gathered from the site. Your data from Twitter was collected using Twitter’s Streaming API and “Gardenhose,” which deliver a random sampling of tweets in real-time. I’d be curious to hear what role you think content holders like Yelp or Twitter can or could play in providing access to this kind of raw data.
Bryan: As a researcher with only the interests of science at heart, it would be best if they just gave me access to all their data! Given that much of the data is valuable to the companies (and privacy, of course), I understand that is not possible. But it is interesting that academic research, and data-sharing more generally, is in a company’s self-interest. Twitter has encouraged a whole ecosystem that has helped them grow. Many companies have an API for that purpose that happens to work nicely for academic research. In general, open access is most preferred in academic settings so that all researchers have access to the same data. Interesting papers using proprietary access to Facebook are less helpful than Twitter.
Julia: Could you tell us a bit about how you processed and organized the data for analysis and how you are working to manage it for the future? Given that reproducibility is such an important concept for science, what ways are you approaching ensuring that your data will be available in the future?
Bryan: This is not my strong suit. But at a high-level, the steps are (roughly) “get,” “clean,” “store,” “extract,” “experiment.” The “get” varies with the data source (an API). The “clean” step is just a matter of being careful with special characters and making sure data are lining up into fields right. If the API is sensible, the “clean” is easy. We usually store things in a JSON format that is flexible. This is usually a good format to share data. The “extract” and “experiment” steps depend on what you are interested in. Word counts? Phrase counts? Other? The key is not to jump from “get” to “extract” — storing the data in as raw form as possible makes thing flexible.
Julia: What role, or potential role, do you see for the future of libraries, archives and museums in working with the kinds of data you collect? That is, while your data is valuable for other researchers now, things like 700,000 Yelp reviews of restaurants will be invaluable to all kinds of folks studying culture, economics and society 10, 20, 50 and 100 years from now. So, what kind of role do you think cultural heritage institutions could play in the long-term stewardship of this cultural data? Further, what kinds of relationships do you think might be able to be arranged between researchers and libraries, archives, and museums? For instance, would it make sense for a library to collect, preserve, and provide access to something like the Yelp review data you worked with? Or do you think they should be collecting in other ways?
Bryan: This is also a great question and also one for which I do not have a great answer. I do not know a lot about the research in “digital humanities,” but that would be a good place to look. People doing digital text-based research on a long-horizon panel of data should provide some insight into what sorts of questions people ask. Similarly, economic data might provide some hints. Finance, for example, has a strong empirical component that comes from having easy-to-access stock data (the CRSP). The hard part for libraries is figuring out which parts to keep. Sampling Twitter, for example, gets a nice time-series of data but loses the ability to track a group of users or Twitter conversations.
Julia: Talking about the paper you co-authored that analyzed Yelp reviews, Dan Jurafsky said “when you write a review on the web you’re providing a window into your own psyche – and the vast amount of text on the web means that researchers have millions of pieces of data about people’s mindsets.” What do you think are some of the possibilities and limitations for analyzing social media content?
Bryan: There are many limitations, of course. Twitter and Yelp are not just providing a window into things, they are changing the way the world works. “Big data” is not just about larger sample sizes of draws from a fixed distribution. Things are non-stationary. (In an early paper using Twitter data, we could see the “Oprah” effect as the number of users jumped in the day following her show about Twitter). Similarly, the data we see in social media is not a representative sample of society cross section. But both of these are the sort of things good modeling – statistical, economic – should, and do, aim to capture. The possibilities of all this new data are exciting. Language is a rich source of data with challenging models needed to turn it into useful information. More generally, social media is an integral part of many economic and social transactions. Capturing that in a tractable model makes for an interesting research agenda.