Government data is at risk, but that is nothing new.
The existence of Data.gov, the Federal Open Data Policy, and open government data belies the fact that, historically, a vast amount of government data and digital information is at risk of disappearing in the transition between presidential administrations. For example, between 2008 and 2012, over 80 percent of the PDFs hosted on .gov domains disappeared. To track these and other changes, California Digital Library (CDL) joined with the University of North Texas, The Library of Congress, the Internet Archive, and the U.S. Government Publishing office to create the End of Term (EOT) Archive. After archiving the web presence of federal agencies in 2008 and 2012, the team initiated a new crawl in September of 2016.
In light of recent events, tools and infrastructure initially developed for EOT and other projects have been taken up by efforts to backup “at risk” datasets, including those related to the environment, climate change, and social justice. Data Refuge, coordinated by the Penn Program of Environmental Humanities (PPEH), has organized a series of “Data Rescue” events across the country where volunteers nominate webpages for submission to the End of Term Archive and harvest “uncrawlable” data to be bagged and submitted to an open data archive. Efforts such as the Azimuth Climate Data Backup Project and Climate Mirror do not involve submitting data or information directly to the End of Term Archive, but have similar aims and workflows.
These efforts are great for raising awareness and building back-ups of key collections. In the background, CDL and the team behind the Dat Project have worked to backup Data.gov, itself. The goal is not only to preserve the datasets catalogued by Data.gov but also the associated metadata and organization that makes it such a useful location for finding and using government data. As a result of this partnership, for the first time ever, the entire Data.gov metadata catalog of over 2 million datasets will soon be available for bulk download. This will allow the various backup efforts to coordinate and cross reference their data sets with those on Data.gov. To allow for further coordination and cross referencing, the Dat team has also begun acquiring the metadata for all the files acquired by Data Refuge, the Azimuth Climate Data Project, and Climate Mirror.
In an effort to keep track of all these efforts to preserve government data and information, we’re maintaining the following annotated list. As new efforts emerge or existing efforts broaden or change their focus, we’ll make sure the list is updated. Feel free to send additional info on government data projects to: firstname.lastname@example.org
Get involved: Ongoing Efforts to Preserve Scientific Data or Support Science
Data.gov – The home of the U.S. Government’s open data, much of which is non-biological and non-environmental. Data.gov has a lightweight system for reporting and tracking datasets that aren’t represented and functions as a single point of discovery for federal data. Newly archived data can and should be reported there. CDL and the Dat team are currently working to backup the data catalogued on Data.gov and also the associated metadata.
End of Term – A collaborative project to capture and save U.S. Government websites at the end of presidential administrations. The initial partners in EOT included CDL, the Internet Archive, the Library of Congress, the University of North Texas, and the U.S. Government Publishing Office. Volunteers at many Data Rescue events use the URL nomination and BagIt/Bagger tools developed as part of the EOT project.
Data Refuge – A collaborative effort that aims to backup research-quality copies of federal climate and environmental data, advocate for environmental literacy, and build a consortium of research libraries to scale their tools and practices to make copies of other kinds of federal data. Find a Data Rescue event near you.
Azimuth Climate Data Backup Project – An urgent project to back up US government climate databases. Initially started by statistician Jan Galkowski and John Baez, a mathematician and science blogger at UC Riverside.
Climate Mirror – A distributed volunteer effort to mirror and back up U.S. Federal Climate Data. This project is currently being lead by Data Refuge.
The Environmental Data and Governance Initiative – An international network of academics and non-profits that addresses potential threats to federal environmental and energy policy, and to the scientific research infrastructure built to investigate, inform, and enforce. EDGI has built many of the tools used at Data Rescue events.
March for Science – A celebration of science and a call to support and safeguard the scientific community. The main march in Washington DC and satellite marches around the world are scheduled for April 22nd (Earth Day).
314 Action – A nonprofit that intends to leverage the goals and values of the greater science, technology, engineering, and mathematics community to aggressively advocate for science.