Scientific data is the biggest of the “big data.” In fact, research data and increased complexity and volume of data are two of the challenges addressed by the National Agenda for Digital Stewardship. To find out more about the data preservation and access challenges at the National Oceanic and Atmospheric Administration, I interviewed George Jungbluth, NOAA’s deputy chief of staff and director of communications.
Mike: Broadly speaking, could you tell us a bit about the kinds of data NOAA collects and why it is important?
George: The National Climatic Data Center collects many types of weather and climate data from the National Weather Service, weather stations, satellite processing systems, radars, weather and climate models, in situ data processing systems and paleoclimate studies. These data are important to inform weather forecasting for the nation, assess drought severity, understand our changing climate, enable fisheries management, promote scientific research and inform decision makers on environmental matters.
Mike: Are you preserving any particularly problematic file types?
George: Our data holdings have many data formats (binary, ascii, text, BUFR, netCDF, JPG, PDF, etc.). Our preference is for platform-independent, self-describing formats such as Network Common Data Form (netCDF). The most problematic formats are older data without proper documentation.
Mike:What are some of the challenges that NOAA faces with data preservation? Is scale a challenge?
George:The volume of data NCDC expects to preserve, store and provide access to is increasing at a rapid pace, which poses a challenge to the rate at which our systems and network bandwidth can scale.
NCDC faces some challenges acquiring data, including how to securely collect the data from hundreds of providers and how to best interface with large volume/rate providers such as satellite systems and modeling systems but more on providing access to the large data volumes.
Mike: What different ways does this data come into NOAA?
George: NCDC acquires data from multiple sources including directly from data producers via documented interfaces (preferred), phone systems, internet transfers and data delivered on physical media.
Mike: There is a significant push to make more government data more broadly accessible. Is this an area that NOAA is doing much work in?
George: Yes. Data accessibility is one of our many faceted challenges. Developing and managing the metadata required for search and display (dataset level) as well as more in depth (file or granule level) metadata needed for understanding and using the data is one challenge. Developing the scalable system for hosting and managing the metadata is another challenge. As mentioned earlier, providing access quickly to multiple petabytes of data is another issue.
Mike:Longitudinal data — continuous readings going back in time — is of critical value for work in studying topics like climate change. What are some of the challenges that NOAA faces in ensuring long-term access to its data?
George: There are many challenges to providing long-term access to data. Mainly the management of the metadata for continuous understanding of data and the management of data formats and access mechanisms.
Mike: What lessons has NOAA learned about data preservation that might be useful for other organizations with similar issues?
George: Preservation planning should begin as early as possible, even before the data are obtained and/or produced. Establish standards (ex. metadata, formats), guidelines and processes to support preservation, and provide tools to enable the preservation. Determine what data and information will be preserved for the long term, understand the costs associated with preservation and provide the necessary resources.
Mike: How do you think NOAA’s experience would be useful for other Federal agencies or for other agencies in other countries with similar missions?
George: Sharing our challenges of scaling our systems to support data growth, developing scalable data and metadata management systems would hopefully prove useful to other organizations.
Mike: Are you developing any specific skill sets for long-term data preservation or data analysis?
George: At NCDC, we encourage our staff to explore new technologies, standards and tools. We are also training our staff on Information Science and Technology principles.
Mike: Who are the primary users for NOAA’s data? What kinds of challenges do you face in meeting the needs of your users?
George: The NCDC users are extremely diverse. Users have widely varying expertise, access needs, response times and types of usage. Data uses include many societal benefit areas (insurance, tourism, energy, etc.). Data are used by researchers for understanding and studying weather and climate, by lawyers in cases involving weather conditions, regulators interested in evaluating energy rates and water usage, and by government and media for monitoring and reporting on recent weather and climate events. The challenge is to provide a set of data-access methods to our data that users can easily find what they need and have access to it.
Mike:What do you see as the biggest data challenges facing NOAA in the next five or ten years? In particular, what kinds of issues do you face in terms of increasing scale of data?
George: In the next five to ten years the volume of satellite data will more than double (from 4 to 10TB/day) and the volume of climate model data will grow exponentially. Our technological infrastructure (storage, networks) will need to scale and our resources for managing data and metadata will need to increase. But the biggest problem by far will be in allowing the easy search, discovery, interpretation and effective analysis of this data so that the data have the most value to the nation.