This post is the latest in our NDSA Innovation Working Group’s ongoing Insights Interview series. Chelcie Rowell (Digital Initiatives Librarian, Wake Forest University) interviews Richard Ball (Associate Professor of Economics, Haverford College) and Norm Medeiros (Associate Librarian, Haverford Libraries) about Teaching Integrity in Empirical Research, or Project Tier.
Chelcie: Can you briefly describe Teaching Integrity in Empirical Research, or Project TIER, and its purpose?
Richard: For close to a decade, we have been teaching our students how to assemble comprehensive documentation of the data management and analysis they do in the course of writing an original empirical research paper. Project TIER is an effort to reach out to instructors of undergraduate and graduate statistical methods classes in all the social sciences to share with them lessons we have learned from this experience.
When Norm and I started this work, our goal was simply to help our students learn to do good empirical research; we had no idea it would turn into a “project.” Over a number of years of teaching an introductory statistics class in which students collaborated in small groups to write original research papers, we discovered that it was very useful to have students not only turn in a final printed paper reporting their analysis and results, but also submit documentation of exactly what they did with their data to obtain those results.
We gradually developed detailed instructions describing all the components that should be included in the documentation and how they should be formatted and organized. We now refer to these instructions as the TIER documentation protocol. The protocol specifies a set of electronic files (including data, computer code and supporting information) that would be sufficient to allow an independent researcher to reproduce–easily and exactly–all the statistical results reported in the paper. The protocol is and will probably always be an evolving work in progress, but after several years of trial and error, we have developed a set of instructions that our students are able to follow with a high rate of success.
Even for students who do not go on to professional research careers, the exercise of carefully documenting the work they do with their data has important pedagogical benefits. When students know from the outset that they will be required to turn in documentation showing how they arrive at the results they report in their papers, they approach their projects in a much more organized way and keep much better track of their work at every phase of the research. Their understanding of what they are doing is therefore substantially enhanced, and I in turn am able to offer much more effective guidance when they come to me for help.
Despite these benefits, methods of responsible research documentation are virtually, if not entirely, absent from the curricula of all the social sciences. Through Project TIER, we are engaging in a variety of activities that we hope will help change that situation. The major events of the last year were two faculty development workshops that we conducted on the Haverford campus. A total of 20 social science faculty and research librarians from institutions around the US attended these workshops, at which we described our experiences teaching our students good research documentation practices, explained the nuts and bolts of the TIER documentation protocol, and discussed with workshop participants the ways in which they might integrate the protocol into their teaching and research supervision. We have also been spreading the word about Project TIER by speaking at conferences and workshops around the country, and by writing articles for publications that we hope will attract the attention of social science faculty who might be interested in joining this effort.
We are encouraged that faculty at a number of institutions are already drawing on Project TIER and teaching their students and research advisees responsible methods of documenting their empirical research. Our ultimate goal is eventually to see a day when the idea of a student turning in an empirical research paper without documentation of the underlying data management and analysis is considered as aberrant as the idea of a student turning in a research paper for a history class without footnotes or a reference list.
Chelcie: How did TIER and your 10-year collaboration (so far!) get started?
Norm: When I came to the Haverford Libraries in 2000, I was assigned responsibility for the Economics Department. Soon thereafter I began providing assistance to Richard’s introductory statistics students, both in locating relevant literature as well as in acquiring data for statistical analysis. I provided similar, albeit more specialized, assistance to seniors in the context of their theses. Richard invited me to his classes and advised students to make appointments with me. Through regular communication, I came to understand the outcomes he sought from his students’ research assignments, and tailored my approach to meet these expectations. A strong working relationship ensued.
Meanwhile, in 2006 the Haverford Libraries in conjunction with Bryn Mawr and Swarthmore Colleges implemented DSpace, the widely-deployed open source repository system. The primary collection Haverford migrated into DSpace was its senior thesis archive, which had existed for the previous five years in a less-robust system. Based on the experience I had accrued to that point working with Richard and his students, I thought it would be helpful to future generations of students if empirical theses coexisted with the data from which the results were generated.
The DSpace platform provided a means of storing such digital objects and making them available to the public. I mentioned this idea to Richard, who suggested that not only should we post the data, but also all the documentation (the computer command files, data files and supporting information) specified by our documentation protocol. We didn’t know it at the time, but the seeds of Project TIER were planted then. The first thesis with complete documentation was archived on DSpace in 2007, and several more have been added every year since then.
Chelcie: You call TIER a “soup-to-nuts protocol for documenting data management and analysis.” Can you walk us through the main steps of that protocol?
Richard: The term “soup-to-nuts” refers to the fact that the TIER protocol entails documenting every step of data management and analysis, from the very beginning to the very end of a research project. In economics, the very beginning of the empirical work is typically the point at which the author first obtains the data to be used in the study, either from an existing source such as a data archive, or by conducting a survey or experiment; the very end is the point at which the final paper reporting the results of the study is made public.
The TIER protocol specifies that the documentation should contain the original data files the author obtained at the very beginning of the study, as well as computer code that executes all the processing of the data necessary to prepare them for analysis–including, for example, combining files, creating new variables, and dropping cases or observations–and finally generating the results reported in the paper. The protocol also specifies several kinds of additional information that should be included in the documentation, such as metadata for the original data files, a data appendix that serves as a codebook for the processed data used in the analysis and a read-me file that serves as a users’ guide to everything included in the documentation.
This “soup-to-nuts” standard contrasts sharply with the policies of academic journals in economics and other social sciences. Some of these journals require authors of empirical papers to submit documentation along with their manuscripts, but the typical policy requires only the processed data file used in the analysis and the computer code that uses this processed data to generate the results. These policies do not require authors to include copies of the original data files or the computer code that processes the original data to prepare them for analysis. In our view, this standard, sometimes called “partial replicability,” is insufficient. Even in the simplest cases, construction of the processed dataset used in the analysis involves many decisions, and documentation that allows only partial replication provides no record of the decisions that were made.
Complete instructions for the TIER protocol are available online. The instructions are presented in a series of web pages, and they are also available for download in a single .pdf document.
Chelcie: You’ve taught the TIER protocol in two main curricular contexts: introductory statistics courses and empirical senior thesis projects. What is similar or different about teaching TIER in these two contexts?
Richard: The main difference is that in the statistics courses students do their research projects in groups made up of 3-5 members. It is always a challenge for students to coordinate work they do in groups, and the challenge is especially great when the work involves managing several datasets and composing several computer command files. Fortunately, there are some web-based platforms that can facilitate cooperation among students working on this kind of project. We have found two platforms to be particularly useful: Dataverse, hosted by the Harvard Institute for Quantitative Social Science, and the Open Science Framework, hosted by the Center for Open Science.
Another difference is that when seniors write their theses, they have already had the experience of using the protocol to document the group project they worked on in their introductory statistics class. Thanks to that experience, senior theses tend to go very smoothly.
Chelcie: Can you elaborate a little bit about the Haverford Dataverse you’ve implemented for depositing the data underlying senior theses?
Norm: In 2013 Richard and I were awarded a Sloan/ICPSR challenge grant with which to promote Project TIER and solicit participants. As we considered this initiative, it was clear to us that a platform for hosting files would be needed both locally for instructors who perhaps didn’t have a repository system in place, as well as for fostering cross-institutional collaboration, whereby students learning the protocol in one participating institution could run replications against finished projects at another institution.
We imagined such a platform would need an interactive component, such that one could comment on the exactness of the replication. DSpace is a strong platform in many ways, but it is not designed for these purposes, so Richard and I began investigating available options. We came across Dataverse, which has many of the features we desired. Although we have uploaded some senior theses as examples of the protocol’s application, it was really the introductory classes for which we sought to leverage Dataverse. Our Project TIER Dataverse is available online.
In fall 2013, we experimented with using Dataverse directly with students. We sought to leverage the platform as a means of facilitating file management and communication among the various groups. We built Dataverses for each of the six groups in Richard’s introductory statistics course. We configured templates that helped students understand where to load their data and associated files. The process of building these Dataverses was time consuming, and at points we needed to jury rig the system to meet our needs. Although Dataverse is a robust system, we found its interface too complex for our needs. This fall we plan to use the Open Science Framework system to see if it can serve our students slightly better. Down the road, we can envision complementary roles for Dataverse and OSF as it relates to Project TIER.
Chelcie: After learning the TIER protocol, do students’ perceptions of the value of data management change?
Richard: Students’ perceptions change dramatically. I see this every semester. For the first few weeks, students have to do a few things to prepare to do what is required by the protocol, like setting up a template of folders in which to store the documentation as they work on the project throughout the semester, and establishing a system that allows all the students in the group to access and work on the files in those folders. There are always a few wrinkles to work out, and sometimes there is a bit of grumbling, but as soon as students start working seriously with their data they see how useful it was to do that up-front preparation. They realize quickly that organizing their work as prescribed by the protocol increases their efficiency dramatically, and by the end of the semester they are totally sold–they can’t imagine doing it any other way.
Chelcie: Have you experienced any tensions between developing step-by-step documentation for a particular workflow and technology stack versus developing more generic documentation?
Richard: The issue of whether the TIER protocol should be written in generic terms or tailored to a particular platform and/or a particular kind of software is an important one, but for the most part has not been the source of any tensions. All of the students in our introductory statistics class and most of our senior thesis advisees use Stata, on either a Windows or Mac operating system. The earliest versions of the protocol were therefore written particularly for Stata users, which meant, for example, we used the term “do-file” instead of “command file,” and instead of saying something like “a data file saved in the proprietary format of the software you are using” we would say “a data file saved in Stata’s .dta format.”
But fundamentally there is nothing Stata-specific about the protocol. Everything that we teach students to do using Stata works just fine with any of the other major statistical packages, like SPSS, R and SAS. So we are working on two ways of making it as easy as possible for users of different software to learn and teach the protocol. First, we have written a completely software-neutral version. And second, with the help of colleagues with expertise in other kinds of software, we are developing versions for R and SPSS, and we hope to create a SAS version soon. We will make all these versions available on the Project TIER website as they become available.
The one program we have come across for which the TIER protocol is not well suited is Microsoft Excel. The problem is that Excel is an exclusively interactive program; it is difficult or impossible to write an editable program that executes a sequence of commands. Executable command files are the heart and soul of the TIER protocol; they are the tool that makes it possible literally to replicate statistical results. So Excel cannot be the principal program used for a project for which the TIER documentation protocol is being followed.
Chelcie: What have you found to be the biggest takeaways from your experience introducing a data management protocol to undergraduates?
Richard: In the response to the first question in this interview, I described some of the tangible pedagogical benefits of teaching students to document their empirical research carefully. But there is a broader benefit that I believe is more fundamental. Requiring students to document the statistical results they present in their papers reinforces the idea that whenever they want to claim something is true or advocate a position, they have an intellectual responsibility to be able to substantiate and justify all the steps of the argument that led them to their conclusion. I believe this idea should underlie almost every aspect of an undergraduate education, and Project TIER helps students internalize it.
Chelcie: Thanks to funding from the Sloan Foundation and ICPSR at the University of Michigan, you’ve hosted a series of workshops focused on teaching good practices in documenting data management and analysis. What have you learned from “training the trainers”?
Richard: Our experience with faculty from other institutions has reinforced our belief that the time is right for initiatives that, like Project TIER, aim to increase the quality and credibility of empirical research in the social sciences. Instructors frequently tell us that they have thought for a long time that they really ought to include something about documentation and replicability in their statistics classes, but never got around to figuring out just how to do that. We hope that our efforts on Project TIER, by providing a protocol that can be adopted as-is or modified for use in particular circumstances, will make it easier for others to begin teaching these skills to their students.
We have also been reminded of the fact that faculty everywhere face many competing demands on their time and attention, and that promoting the TIER protocol will be hard if it is perceived to be difficult or time-consuming for either faculty or students. In our experience, the net costs of adopting the protocol, in terms of time and attention, are small: the protocol complements and facilitates many aspects of a statistics class, and the resulting efficiencies largely offset the start-up costs. But it is not enough for us to believe this: we need to formulate and present the protocol in such a way that potential adopters can see this for themselves. So as we continue to tinker with and revise the protocol on an ongoing basis, we try to be vigilant about keeping it simple and easy.
Chelcie: What do you think performing data management outreach to undergraduate, or more specifically TIER as a project, will contribute to the broader context of data management outreach?
Richard: Project TIER is one of a growing number of efforts that are bubbling up in several fields that share the broad goal of enhancing the transparency and credibility of research in the social sciences. In Sociology, Scott Long of Indiana University is a leader in the development of best practices in responsible data management and documentation. The Center for Open Science, led by psychologists Brian Nosek and Jeffrey Spies of the University of Virginia, is developing a web-based platform to facilitate pre-registration of experiments as well as replication studies. And economist Ted Miguel at UC Bekeley has launched the Berkeley Initiative for Transparency in the Social Sciences (BITSS), which is focusing its efforts to strengthen professional norms of research transparency by reaching out to early career social scientists. The Inter-university Consortium for Political and Social Research (ICPSR), which for over 50 year has served as a preeminent archive for social science research data, is also making important contributions to responsible data stewardship and research credibility. The efforts of all these groups and individuals are highly complementary, and many fruitful collaborations and interactions are underway among them. Each has a unique focus, but all are committed to the common goal of improving norms and practices with respect to transparency and credibility in social science research.
These bottom-up efforts also align well with several federal initiatives. Beginning in 2011, the NSF requires all proposals to include a “data management plan” outlining procedures that will be followed to support the dissemination and sharing of research results. Similarly, the NIH requires all investigator-initiated applications with direct costs greater than $500,000 in any single year to address data sharing in the application. More recently, in 2013 the White House Office on Science and Technology Policy issued a policy memorandum titled “Increasing Access to the Results of Federally Funded Scientific Research,” directing all federal agencies with more than $100 million in research and development expenditures to establish guidelines for the sharing of data from federally funded research.
Like Project TIER, many of these initiatives have been launched just within the past year or two. It is not clear why so many related efforts have popped up independently at about the same time, but it appears that momentum is building that could lead to substantial changes in the conduct of social science research.
Chelcie: Do you think the challenges and problems of data management outreach to students will be different in 5 years or 10 years?
Richard: As technology changes, best practices in all aspects of data stewardship, including the procedures specified by the TIER protocol, will necessarily change as well. But the principles underlying the protocol–replicability, transparency, integrity–will remain the same. So we expect the methods of implementing Project TIER will continually be evolving, but the aim will always be to serve those principles.
Chelcie: Based on your work with TIER, what kinds of challenges would you like for the digital preservation and stewardship community to grapple with?
Norm: We’re glad to know that research data are specifically identified in the National Agenda for Digital Stewardship. There is an ever-growing array of non-profit and commercial data repositories for the storage and provision of research data; ensuring the long-term availability of these is critical. Although our protocol relies on a platform for file storage, Project TIER is focused on teaching techniques that promote transparency of empirical work, rather than on digital object management per se. This said, we’d ask that the NDSA partners consider the importance of accommodating supplemental files, such as statistical code, within their repositories, as these are necessary for the computational reproducibility advocated by the TIER protocol. We are encouraged by and grateful to the Library of Congress and other forward-looking institutions for advancing this ambitious Agenda.