When Kathleen O’Neill talks about digital collections, she slips effortlessly into the info-tech language that software engineers, librarians, archivists and other information technology professionals use to communicate with each other. O’Neill, a senior archives specialist in the Library of Congress’s Manuscript Division, speaks with authority about topics such as file signatures, hex editors and checksums even though she has a traditional paper-centric Master of Library Science degree. She picked up her technology expertise on the job, through years of rescuing digital content off of erratic computers, troublesome files and unstable storage media.
The Library often acquires a collection at the end of someone’s career, which means that many of the digital files that O’Neill sees in collections may have been created decades ago. And chances are good that some of the storage devices, or the files they contain, will be obsolete and will require the Manuscript Division to process them with special digital forensics resources.
When a collection is first received by the Manuscript Division, a staff member reviews the contents and if digital media devices are found, they are transferred to the digital collections registrar, O’Neill. Archivists might find digital storage devices among the paper documents later when they are processing the collection. In either case, O’Neill records receipt of the materials in a local database. The record includes the collection name, collection number, a registration number and any additional notes about it. Said O’Neill, “If I get digital material, I give it a registration ID and that forms the beginning of what will become a unique ID for each piece of media.” This begins the tracking information or what O’Neill calls the “chain of custody.”
Once O’Neill gets the storage devices (which are essentially computer hardware versions of collection boxes or containers) she completes or supervises the following tasks:
- Physical inventory of the storage devices.
- Transfer of the files off the storage devices using the Bagger tool.
- Transfer of the files to the Library’s digital repository for long-term preservation.
Kimberly Owens, senior archives technician, is responsible for physical inventory of the storage devices. This task includes photographing the disk onto a standard paper form filled with information about the disk, most importantly the location from which it came. When she or the archivist assigned to the collection finishes adding the metadata, the paper form is placed into the paper collection where the disk was; the disk will go to a special hardware/software storage area. “If you remove the disk without creating proper documentation, you could lose context,” said O’Neill. “When you’re working on hundreds of pieces of media it’s hard to keep track, so it’s nice to have a visual to go back to and make sure that it’s the right one. We count and photograph each piece of media, which can be labor intensive. One of our larger collections had well over six hundred 3.5″ floppies.”
Before any action is taken on a storage device, O’Neill protects the media from being unintentionally altered by write-protecting it. She “locks” 3.5″ floppies by moving the tab on the disk to the write-protect position. For other media, the division has a variety of write-blockers on hand. O’Neill creates a file directory listing for each piece of media. “We do a simple command-line file directory listing so that we can capture a list of the files and the dates they were modified,” said O’Neill. “It’s been really useful early on because I’ve been able to identify different materials that we should or should not have.”
The next step is to bag the media. O’Neill prepares the files by organizing them into bags (PDF) using the Library of Congress’s Bagger tool. The bags contain the digital collection material along with self-describing documentation. The structure of a bag includes:
- a directory containing the file or files (data)
- a checksummed “manifest, a receipt that itemize the files in the bag
- A “bagit.txt” file that declares, “I am a bag.”
Often, this step does not go simply or smoothly.
Old disks can be unpredictable and inconsistent; just trying to view their contents could be a challenge. She might pop an old floppy disk into one computer and nothing happens; the computer might not recognize the presence of the disk. Or worse, the computer might ask if she wants to format the disk (and erase its contents). If she pops the disk into another computer of the same year, make and model, she might easily see the disk and view its contents. Or the disk might display onscreen but appear to be empty, when it’s really not.
O’Neill and her colleagues have a number of desktop tools for examining and reporting on disk contents, disk structure, modification dates and so on.
If they need more sophisticated tools for problematic disks, she can take the disks to the Library’s Preservation Digital Reformatting Program and use the Forensic Recovery of Evidence Device, or FRED (a high-end digital-forensics workstation), to analyze and recover files off the disks.
The FRED can read and access files off a variety of disparate storage media without accidentally damaging or erasing the contents. Loaded onto FRED is the Forensic Toolkit, or FTK, which is software for diagnosing disks and files, searching and sorting and restoring “deleted” files. FTK can identify and create reports about files’ properties and formats, declare which OS and software (and their versions) created the files and what applications will read them. FTK Imager can create a disk image (an exact copy of the content, exactly as it is on the original storage device, including data and structure information) so users can work with the copy to avoid the risk of damaging the original.
Another resource available to the Manuscript Division staff is BitCurator, which we profiled in the Signal interview with Cal Lee and the story about the Maryland Institute for Technology in the Humanities. Some of BitCurator’s hardware and software functions are modestly comparable to the FRED’s. But BitCurator was designed specifically to help digital archivists manage information, especially sensitive personal information that may be contained within the collections.
O’Neill said that the Manuscript Division has not had a major issue yet with a CD-ROM or DVD but if they do they can consult with the experts in the Library’s Preservation, Research and Testing Division to analyze it.
Through their thorough and detailed records, O’Neill and her colleagues have been able to search for patterns among certain storage media. For example, they discovered that old double-sided high density disks can be difficult to access with modern equipment. “But, then again, sometimes it could just be a bad set of disks,” said O’Neill. “I’ve hit a whole run where double-sided, high-density disks don’t work. I can’t get them to read. Sometimes I’m able to recover them quickly; sometimes not. Sometimes the third time works. It’s trial and error. But you have to balance that with what’s on the disk and whether or not it worth the extra levels of work.”
After O’Neill accesses and catalogs the contents of a disk, she copies the files off the disks and into the Library of Congress repository. For this purpose, she uses the Library’s Content Transfer System, which enables staff to describe, inventory and transfer files from local media to the repository.
The Content Transfer System allows staff to validate their integrity of the files and the completeness of the bag checking them against the manifest to confirm that nothing has changed. “We try to generate a checksum at the earliest point in the process,” said O’Neill, “so that we have that checksum carried through as the file moves through our system. It’s a way to document the authenticity of the file.”
O’Neill copies the bagged master files to the Library of Congress repository, which replicates them locally and at a geographically remote location. The Content Transfer System also scans for viruses when it ingests the file. The Content Transfer System tracks the user login information and all the metadata associated with the files, so between the Manuscript Division’s local database and the Content Transfer System, there is a continuous record associated with a given file from the time the first staff member appraised the collection.
Finally, O’Neill takes the original digital hardware and software and shelves it for preservation, in case someone ever needs to access it again.
In March, 2015, I asked O’Neill for an inventory of the storage media currently in the Library’s Manuscript Division collections and she came up with this list:
930 – 3.5″ floppies
250 – 5.25″ floppies
145 – Optical media (CDs, DVDs, CD-Rs, etc.)
65 – 8″ floppies
35 – Zip drives
30 – Computer tapes
3 – CPUs
3 – Bernoulli disks
4 – Flash drives
3 – External hard drives
Of course that list is continually expanding and, in fact, it should increase exponentially in the near future as the Library acquires collections from donors who created the bulk of their works digitally. The Manuscript Division has completed ingest to long-term storage on approximately 500 of these media. Kimberly Owens is working to inventory and bag the remaining backlog.
Researchers visiting the Library of Congress can access copies of some of the digital collections but access depends on copyright and the conditions set by the collection donor. There are also technological challenges to serving up records. While the Division is scheduled for some infrastructure upgrades in the next several months, in the meantime the reading room terminals are connected to the Library’s network and not to a hard drive that is loaded with software that could open, say, graphics or documents. “That means that, depending on the file format, the researcher may or may not be able to read the file,” said O’Neill. “It would have to be something that was renderable in a browser. And if it’s renderable in a browser, somebody could copy it and email it to themselves. So there are security issues that we are trying to work through.” Access is currently available only onsite in the Manuscript Reading Room.
Viewing some files will continue to be a technological challenge, especially the old or obscure file types. “The earliest digital material that we’ve seen is from 1987,” said O’Neill. “That would be Word Perfect 4.1 or something or a really old Apple file format. All of that is tricky because we don’t have the software to read everything. No one at the Library has the software or drives to read every file format.”
So the volume of digital files that the Manuscript Division — and other divisions around the Library — archive far exceeds the ability of the staff to test each one to see whether it displays or not. A checksum is the best automated option for now for file integrity right now, checking the inventory at lightning speed to see if the stability of a file has changed or not. Display of the file’s contents is a different matter.
More than likely, instances of restoration will happen as researchers come to the Library to experience files in their original environments, on old computers with old operating systems. Maybe there will be an emulation solution and we won’t need original hardware and software. That would require a whole other set of digital forensics tools, many of which the Manuscript Division already has.
Not all researchers will require a perfect rendering of the original file though, said O’Neill. “I think there will be two very different levels of interest from researchers,” she said. “There are the high-powered, technically savvy digital humanities people who seem to be driving a lot of the conversation in this area. But I think quite a lot of people are just interested in the information. They don’t care what the file format is. They want the information.”
At the very least, the Manuscript Division has an efficient end-to-end curatorial system in place, one that they continue to refine. “We tried to understand and perfect the ingest portion of it, so that we knew things were saved and safe and inventoried,” said O’Neill. “Access and appraisal of digital collections are ongoing issues with the archivists. And with every collection that comes in, there’s always some weird quirk that makes us re-think everything.”