We’ve written about the BitCurator project a number of times, but the project has recently entered a new phase and it’s a great time to check in again. The BitCurator Access project began in October 2014 with funding through the Mellon Foundation. BitCurator Access is building on the original BitCurator project to develop open-source software that makes it easier to access disk images created as part of a forensic preservation process.
Kam Woods has been a part of BitCurator from the beginning as its Technical Lead, and he’s currently a Research Scientist in the School of Information and Library Science at the University of North Carolina at Chapel Hill. As part of our Insights Interview series we talked with Woods about the latest efforts to apply digital forensics to digital preservation.
Butch: How did you end up working on the BitCurator project?
Kam: In late 2010, I took a postdoc position in the School of Information and Library Science at UNC, sponsored by Cal Lee and funded by a subcontract from an NSF grant awarded to Simson Garfinkel (then at the Naval Postgraduate School). Over following months I worked extensively with many of the open source digital forensics tools written by Simson and others, and it was immediately clear that there were natural applications to the issues faced by collecting organizations preserving born-digital materials. The postdoc position was only funded for one year, so – in early 2011 – Cal and I (along with eventual Co-PI Matthew Kirschenbaum) began putting together a grant proposal to the Andrew W. Mellon Foundation describing the work that would become the first BitCurator project.
Butch: If people have any understanding at all of digital forensics it’s probably from television or movies, but I suspect the actions you see there are pretty unrealistic. How would you describe digital forensics for the layperson? (And as an aside, what do people on television get “most right” about digital forensics?)
Kam: Digital forensics commonly refers to the process of recovering, analyzing, and reporting on data found on digital devices. The term is rooted in law enforcement and corporate security practices: tools and practices designed to identify items of interest (e.g. deleted files, web search histories, or emails) in a collection of data in order to support a specific position in a civic or criminal court case, to pinpoint a security breach, or to identify other kinds of suspected misconduct.
The goals differ when applying these tools and techniques within archives and data preservation institutions, but there are a lot of parallels in the process: providing an accurate record of chain of custody, documenting provenance, and storing the data in a manner that resists tampering, destruction, or loss. I would direct the interested reader to the excellent and freely available 2010 Council on Library and Information Resources report Digital Forensics and Born-Digital Content in Cultural Heritage Institutions (pdf) for additional detail.
You’ll occasionally see some semblance of a real-world tool or method in TV shows, but the presentation is often pretty bizarre. As far as day-to-day practices go, discussions I’ve had with law enforcement professionals often include phrases like “huge backlogs” and “overextended resources.” Sound familiar to any librarians and archivists?
Butch: Digital forensics has become a hot topic in the digital preservation community, but I suspect that it’s still outside the daily activity of most librarians and archivists. What should librarians and archivists know about digital forensics and how it can support digital preservation?
Kam: One of the things Cal Lee and I emphasize in workshops is the importance of avoiding unintentional or irreversible changes to source media. If someone brings you a device such as a hard disk or USB drive, a hardware write-blocker will ensure that if you plug that device into a modern machine, nothing can be written to it, either by you or some automatic process running on your operating system. Using a write-blocker is a baseline risk-reducing practice for anyone examining data that arrives on writeable media.
Creating a disk image – a sector-by-sector copy of a disk – can support high-quality preservation outcomes in several ways. A disk image retains the entirety of any file system contained within the media, including directory structures and timestamps associated with things like when particular files were created and modified. Retaining a disk image ensures that as your external tools (for example, those used to export files and file system metadata) improve over time, you can revisit a “gold standard” version of the source material to ensure you’re not losing something of value that might be of interest to future historians or researchers.
Disk imaging also mitigates the risk of hardware failure during an assessment. There’s no simple, universal way to know how many additional access events an older disk may withstand until you try to access it. If a hard disk begins to fail while you’re reading it, chances of preserving the data are often higher if you’re in the process of making a sector-by-sector copy in a forensic format with a forensic imaging utility. Forensic disk image formats embed capture metadata and redundancy checks to ensure a robust technical record of how and when that image was captured, and improve survivability over raw images if there is ever damage to your storage system. This can be especially useful if you’re placing a material in long-term offline storage.
There are many situations where it’s not practical, necessary, or appropriate to create a disk image, particularly if you receive a disk that is simply being used as an intermediary for data transfer, or if you’re working with files stored on a remote server or shared drive. Most digital forensics tools that actually analyze the data you’re acquiring (for example, Simson Garfinkel’s bulk extractor, which searches for potentially private and sensitive information and other items of interest) will just as easily process a directory of files as they would a disk image. Being aware of these options can help guide informed processing decisions.
Finally, collecting institutions spend a great deal of time and money assessing, hiring and training professionals to make complex decisions about what to preserve, how to preserve it and how to effectively provide and moderate access in ways that serve the public good. Digital forensics software can reduce the amount of manual triage required when assessing new or unprocessed materials, prioritizing items that are likely to be preservation targets or require additional attention.
Butch: How does BitCurator Access extend the work of the original phases of the BitCurator project?
Kam: One of the development goals for BitCurator Access is to provide archives and libraries with better mechanisms to interact with the contents of complex digital objects such as disk images. We’re developing software that runs as a web service and allows any user with a web browser to easily navigate collections of disk images in many different formats. This includes: providing facilities to examine the contents of the file systems contained within those images; interact with visualizations of file system metadata and organization (including timelines indicating changes to files and folders); and download items of interest. There’s an early version of this tool available .
We’re also working on software to automate the process of redacting potentially private and sensitive information – things like Social Security Numbers, dates of birth, bank account numbers and geolocation data – from these materials based on reports produced by digital forensics tools. Automatic redaction is a complex problem that often requires knowledge of specific file format structures to do correctly. We’re using some existing software libraries to automatically redact where we can, flag items that may require human attention and prepare clear reports describing those actions.
Finally, we’re exploring ways in which we can incorporate emulation tools such as those developed at the University of Freiburg using the Emulation-as-a-Service model.
Butch: I’ve heard archivists and curators express ethical concerns about using digital forensics tools to uncover material that an author may not have wished be made available (such as earlier drafts of written works). Do you have any thoughts on the ethical considerations of using digital forensics tools for digital preservation and/or archival purposes?
Kam: There’s a great DPC Technology Watch report from 2012, Digital Forensics and Preservation (pdf), in which Jeremy Leighton John frames the issue directly: “Curators have always been in a privileged position due to the necessity for institutions to appraise material that is potentially being accepted for long-term preservation and access; and this continues with the essential and judicious use of forensic technologies.”
What constitutes “essential and judicious” is an area of active discussion. It has been noted elsewhere (see the CLIR report I mentioned earlier) that the increased use of tools with these capabilities may necessitate revisiting and refining the language in donor agreements and ethics guidelines.
As a practical aside, the Society of American Archivists Guide to Deeds of Gift includes language alerting donors to concerns regarding deleted content and sensitive information on digital media. Using the Wayback Machine, you can see that this language was added mid-2013, so that provides some context for the impact these discussions are having.
Butch: An area that the National Digital Stewardship Alliance has identified as important for digital preservation is the establishment of testbeds for digital preservation tools and processes. Do you have some insight into how http://digitalcorpora.org/ got established, and how valuable it is for the digital forensics and preservation communities?
Kam: DigitalCorpora.org was originally created by Simson Garfinkel to serve as a home for corpora he and others developed for use in digital forensics education and research. The set of materials on the site has evolved over time, but several of the currently available corpora were captured as part of scripted, simulated real-world scenarios in which researchers and students played out roles involving mock criminal activities using computers, USB drives, cell phones and network devices.
These corpora strike a balance between realism and complexity, allowing students in digital forensics courses to engage with problems similar to those they might encounter in their professional careers while limiting the volume of distractors and irrelevant content. They’re freely distributed, contain no actual confidential or sensitive information, and in certain cases have exercises and solution guides that can be distributed to instructors. There’s a great paper linked in the Bibliography section of that site entitled “Bringing science to digital forensics with standardized forensic corpora” (pdf) that goes into the need for such corpora in much greater detail.
We’ve used disk images from one corpus in particular – the “M57-Patents Scenario” – in courses taken by LIS students at UNC and in workshops run by the Society of American Archivists. They’re useful in illustrating various issues you might run into when working with a hard drive obtained from a donor, and in learning to work with various digital forensics tools. I’ve had talks with several people about the possibility of building a realistic corpus that simulated, say, a set of hard drives obtained from an artist or author. This would be expensive and require significant planning, for reasons that are most clearly described in the paper linked in the previous paragraph.
Butch: What are the next steps the digital preservation community should address when it comes to digital forensics?
Kam: Better workflow modeling, information sharing and standard vocabularies to describe actions taken using digital forensics tools are high up on the list. A number of institutions do currently document and publish workflows that involve digital forensics, but differences in factors like language and resolution make it difficult to compare them meaningfully. It’s important to be able to distinguish those ways in which workflows differ that are inherent to the process, rather than the way in which that process is described.
Improving community-driven resources that document and describe the functions of various digital forensics tools as they relate to preservation practices is another big one. Authors of these tools often provide comprehensive documentation, but they doesn’t necessarily emphasize those uses or features of the tools that are most relevant to collecting institutions. Of course, a really great tool tutorial doesn’t really help someone who doesn’t know about that tool, or isn’t familiar with the language being used to describe what it does, so you can flip this: describing a desired data processing outcome in a way that feels natural to an archivist or librarian, and linking to a tool that solves part or all of the related problem. We have some of this already, scattered around the web; we just need more of it, and better organization.
Finally, a shared resource for robust educational materials that reflect the kinds of activities students graduating from LIS programs may undertake using these tools. This one more or less speaks for itself.