BagIt at the Library of Congress

This 10th anniversary celebration of BagIt is a guest post by Liz Madden, Digital Media Project Coordinator in the Office of the Chief Information Officer’s Platform Services Division.  

The BagIt File Packaging Format hit two milestones in 2018: it celebrated its tenth anniversary and in October it became an IETF 1.0 specification.  The child of a National Digital Information Infrastructure & Preservation Program (NDIIPP)-era collaboration between LC and the California Digital Library, BagIt derives its simplicity and practicality from the years of lessons learned from digital content transfer and management in the earliest era of the modern digital library. Named for the concept of “bag it and tag it”, BagIt provides a directory structure and a specifies a set of files for transferring and storing files that includes clear delineations between the digital content itself (stored in a subdirectory called “data”) and the metadata quantifying it, including a manifest of filenames and checksum values (called a manifest). It also allows for optional basic descriptive elements that are stored within the bag (in a file called bag-info.txt) to provide recipients or custodians of the content with enough information to identify the provenance, contact information, and context for the file delivery or storage package. A common LC BagIt bag looks something like this:

8 lines of text displaying BagIt bag contents

Sample LC BagIt bag contents

BagIt helped the Library bridge the gap between the old world, where the only digital content that we managed was created through digitization of a small number of select physical collections, and the current era, where in 2018 we received more eJournals through Copyright deposit than we received physical journals for the first time.  The early adoption of BagIt at the Library of Congress was critical to our success at expanding digital ingest activities to accommodate the increase in size and number of digital content deliveries that have occurred in the first two decades of this new millennium. A comparison of the decade before BagIt and the decade since illustrates how far we’ve come here in the digital library world of the Library of Congress, and how one little specification gave us a much-needed standardized framework for building ingest and transfer processes that were gigabyte-scale in the beginning and are petabyte-scale now.

The Early Days

I came to LC in 1997 as part of the National Digital Library Program to help with American Memory, the Library’s flagship digital library on the worldwide web (how people commonly referred to it in those days). People still regularly included both the http and the www when they cited specific websites in conversation. There was no Google to speak of. And when we digitized materials from the collections, we received the scanned images back on CDs with somewhere in the neighborhood of 700 megabytes of capacity. The file naming convention was 8.3 (eight-character filename, three-character extension). We optimized the image sizes for users with dialup connections. Digital collections that exceeded 50 GB in total seemed huge to us, and we FTP’ed all the incoming data through our Windows desktop machines onto our one web server. When that became impractical for the increasingly larger collections because of the number of CDs required for delivery, we set up CD towers that we had to load manually.

During this time we talked amongst ourselves about automated ways to ensure that we had gotten all data off the CDs—intact—but the momentum of managing the throughput and putting the content online by our year-2000 milestone preempted our earliest efforts to develop a process to verify the pre- and post-transfer data. We weren’t overly concerned about loss because we selected and collated all content prior to digitization, managed the delivery of the digital content and then did virtually 100% quality assurance on what was returned. We considered the content to be digital surrogates of unique and special Library collections material, rather than the best or only copy of the content itself. We were not dealing with born-digital content in American Memory, and our goal was to enable anyone with an internet connection to be able to browse and use content from the Library of Congress. We maintained this method of transfer into the new millennium with only slight changes—like adding the CD towers or upgrading from CD to DVD with its larger capacity. Since we controlled all aspects of the digital content from selection to digitization to delivery to storage to presentation, we knew which physical collections each parent directory represented, which physical objects each subdirectory represented, and which American Memory presentation they all belonged to because the same staff had managed all aspects of the digital content from collation to presentation.

Change is in the Air

By the early 2000s, digital activities were dispersing and multiplying like tribbles or wet gremlins in front of our eyes. We launched web archiving projects, which yielded content that was both too large and too impractical for CD or DVD transfer, so it arrived on internal hard drives (read: CPUs) that were FedEx’ed to us because that was deemed more reliable than daily FTP. On top of being too big for our familiar transfer media, web archive content file deliveries couldn’t be scoped out ahead of time. There was no way to predict what files could and should be returned from a crawl, and no going back in time to redo a crawl for a given date if we lost a file. So we definitely wanted to make sure we got all the files that were delivered. Then there was the Library’s active engagement in collaborations with other countries, from whom we received already-digitized content from their own collections and made them available from the Library’s website through Meeting of Frontiers and Global Gateways. Additionally, we were increasingly having to transfer and manage content digitized by partner organizations as well as rights-restricted content digitized in-house, which presented new challenges.

Throughout this time the safest and most practical place to upload and manage all of the content, regardless of provenance or intended use, was on the same server used for all things digital. Communicating out to all the content custodians about routine maintenance on the servers or manage storage often required some amount of investigative work to identify the correct contact name for unfamiliar new content that had appeared on the server because there was nothing in the content itself that could identify it, no common web application access point to contextualize it, and no common system where it was all described and accounted for.

Amidst all that activity the Library received its first major digital acquisition in 2003 in the form of the September 11th Digital Archive from George Mason University (GMU). This was noteworthy as the first transfer of a digital archive to the Library. At about 15 GB in size, it also seemed sufficiently portable to be used as the corpus for the first NDIIPP project, called the Archive Ingest & Handling Test (AIHT). As part of the AIHT we retrieved it from GMU on a hard drive (the first one I’d ever used) and brought it back to the Library for testing. It seemed like magic to me that we could bring back 15 GB on one device when just a few years earlier it would’ve taken a CD tower to load it all. This time we had only one device to—yes—plug into our Windows NT desktop machines and upload to the server.

The illusion of magic evaporated abruptly when we did a file count of what was on the hard drive and compared it to what we had loaded onto the server and the numbers didn’t match. This wasn’t unfamiliar to us, so we just re-did the upload as we routinely did when we had different counts before and after transfer. But the counts still didn’t pan out after another couple of tries. It seemed odd to us, who had confidence in our computers to count better than we could. We asked GMU how many files they’d sent us, and they told us a number that matched neither what we saw on the hard drive that was plugged into our desktop nor the list on the server.

Then we knew something was up. We looked very carefully at the steps we had taken along the way, and none of it seemed out of the ordinary to us. GMU had copied their data to a hard drive. We’d copied the data off the hard drive and onto our server as we had done countless times before. And yet, we couldn’t reproduce the expected file counts.  We retraced our steps. We tried again and got similar results. This time we looked more carefully at the actual files on the drive and noticed for the first time that some of the directories had spaces in them, and many of the filenames were unlike those we were accustomed to. For example: in_getdata.asp?tpl=eFAZ&mode=MED&id={58F821AB-1D04-4A36-9B60-73AB6CCFF352}  or button3.asp?tagver=3&SiteId=69429&Sid=005-01-4-17-233860-69429&Tz=-500&firstwkday=sunday&Edition=ecommerce&title=NO SCRIPT&url=http:/noscript&javaOK=No&

We learned that GMU had copied the data off their Unix/Linux ext3 formatted file system onto an NTFS drive, which we’d then plugged into a Windows NT desktop and FTP’ed the content up to our IBM rs6000 AIX server. The different file systems and operating systems read those file names differently and in some cases split file names into multiple files with partial names truncated off the original. On top of this, there were incompatibilities in tar versions between the operating systems. By transferring the content in this way, we had basically played the data transfer equivalent of Telephone; each component interpreting the data according to its own file system and file naming structure rules.  There was no way to know for sure that we’d even gotten the correct number of files with the correct names without getting a manifest of all the file names and paths from GMU. It was a whole different ballgame from the discrete controlled world of the 1990s, where we knew when we sent out the content for digitization how many files we’d get back, and we could at least do a count of the files on a CD and compare it to a count of the transferred files on the server to make sure we’d uploaded it all.

We also watched as the size and volume of digital content began to increase in the early 2000s. We got fewer and fewer deliveries on CD and DVD and more and more on external hard drives.

Still digitizing large quantities of our own collections, we also turned our attention to digital-only/born-digital content transfer through programs like the NDIIPP and the National Digital Newspaper Program (NDNP). While NDIIPP analyzed what might be required to maintain digital content for the long term, sharing the content and the responsibility across multiple organizations and partnerships, NDNP necessitated the transfer of large volumes of data from multiple external organizations to the Library for storage and for access through the Chronicling America website.  In a few short years we had sailed out from the small pond of American Memory and were bouncing down the rapids of a millennium that had never known a non-digital world. Content that used to come in on 700-mb CD a few times a week was now arriving on external hard drives that could hold 20 times that amount and deliver it all at once. We had content coming in through multiple programs across the whole institution, all for different uses and purposes. In short we had gone from gigabyte-scale to terabyte-scale, and from a world where we had met every file personally and were present at its conception, to a world where we were receiving content that in some cases even the senders couldn’t quantify reliably in counts or names what it was that we should have received exactly. We even got some more servers to hold it all.

Enter BagIt

If you’ve followed this journey so far, we have reached the point when BagIt was born. Lessons learned from NDIIPP projects such as the AIHT and the Web-at-Risk project with the California Digital Library, combined with the experience of transferring and maintaining digital content within the context of a large digital library helped inform the BagIt specification. It outlines file naming structures that are optimized to work as data moves across operating systems and file systems; it provides a manifest of all the files included in a delivery and provides a correct checksum value for each; it allows the senders and recipients to include additional provenance information so that anyone viewing the package can identify immediately what the package contains and who the contacts for it are. These facets may seem obvious now, but they had not been quantified or qualified in a specification like BagIt before 2008.

More importantly, BagIt provided a framework that enabled the standardization of toolsets. Developers at LC built code around the BagIt specification and created the BagIt Library (BIL), upon which they constructed an entire workflow system for ingest and transfer. The system works with both BagIt and non-BagIt content (what we call “bagged” and “unbagged”) but it’s optimized for BagIt. It verifies incoming BagIt content to guarantee that the content has retained its integrity from the point of creation through delivery and ingest to storage. For unbagged content, the system creates the manifest from the content itself at the point of ingest, verifies it on each step of ingest the same way it does for bagged content, and bags it on copy to long-term storage.

The Copyright historical records card catalog digitization project was the first large-scale project for which we received contractor deliveries in BagIt structure and used the new system, which we now know as CTS, to ingest the more than 25,000 bags of copyright cards between 2010 and 2014. As of the typing of these words in 2019, we have inventoried nearly 700,000 bags with almost half a billion files. Delivery in BagIt structure is now a requirement for digitization contracts at LC. CTS integrates values from both BagIt tags and custom tags in the bag-info.txt file to enhance inventory reporting. We now have the custodian, the file manifest and checksums for all the content stored in the inventory system. We’ve come a long way in the ten years since BagIt.

If this all sounds good to you and you want to use BagIt too, the Library supports multiple tools to help content creators, donors, and service providers package digital content in BagIt structure on GitHub. The tools range from a simple GUI that can be downloaded and run on a desktop to package up smaller content easily, to more robust tools that can be integrated with production environments to create BagIt bags at a larger scale. As you can tell, I’m a fan. The days of scrambling to figure out whether we received the content as intended, or trying to locate the custodian for a certain set of files are largely behind us. The system we had dreamed of in the 1990s to check file integrity before and after transfer is real now, which is a relief because nowadays when a single project has been known to ingest nine million files in a single month, it’s no longer possible to eyeball every single file prior to ingesting it.

So, Happy 1.0, BagIt! I don’t know how we would have moved from gigabytes to petabytes without you!

We stand with virtual boom box outside the window of the authors, contributors and supporters of BagIt from the early days until now: especially Andy Boyko, John Kunze, Justin Littman, John Scancella, Chris Adams, David Brunton, Rosie Storey, Ed Summers, Brian Vargas, Kate Zwaard, the San Diego Super Computer Center, Stephen Abrams, Mike Ashenfelder, Dan Chudnov, Dave Crocker, Scott Fisher, Brad Hards, Erik Hetzner, Keith Johnson, Leslie Johnston, David Loy, Mark Phillips, Tracy Seneca, Stian Soiland-Reyes, Brian Tingle, Adam Turoff, and Jim Tuttle.

A few additional resources: