Today’s guest post is from Abbie Grotke, Assistant Head of the Digital Content Management section at the Library of Congress.
Users of the Library of Congress Web Archives may have recently noticed issues when trying to access archived content presented at webarchive.loc.gov. We want to give some background and explanation about the ongoing work that is happening to modernize and improve functionality, and to set the stage for future announcements about planned improvements for access to the Library’s Web Archives. This explanation gets a bit in the weeds, but we wanted to be transparent about where we are right now.
The Web Archiving program celebrated its 23rd year in 2023. In the early days we were piloting web archiving workflows, then shifted to collecting more systemically, steadily growing the collection. As our program continued to grow and more content was published online (for example, websites, PDF documents and reports, and parts of websites) that was of collection development interest for our subject and language experts that we work with, more of those experts got involved and our collections grew even more, which was a happy problem! Our program became a wild success! And as you can imagine, being entirely online, web archiving became excellent remote work as staff were sent home during the pandemic, and the last few years saw even more growth in the amount we were preserving.
With that success, however, came challenges with scale that are affecting access to the collections. The Library of Congress web archives currently are about four petabytes in size, and we continue to grow as the program expands and becomes a more routinized collecting method.
First, let me explain a bit about how we serve up the web archives. They are not quite the same as viewing a singular photograph or a multi-page book with a “page-turner” on loc.gov. For the web archives, we create item records with metadata about the web archives and thumbnails that users can see and search alongside other Library collections on the Library’s website. We began rolling monthly releases in 2020 as content exited our one year embargo, and we were thrilled to make more content available as soon as we could, without too much human intervention.
The archived web content is stored in a special preservation format called WARC files, and to get the content out in order to render them in a browser, we must use additional software, sometimes called “replay” or “play back” software. In order to provide access to the archives, the item records link off to our replay software at the URL webarchive.loc.gov. Currently we are using open-source software called OpenWayback, which was developed by the web archiving community through the International Internet Preservation Consortium (IIPC). The Library of Congress is a founding member and active participant in this vibrant and supportive community.
In recent years, new “next generation” tools have been developed to improve access to web archives. This included an open-source, Python-based replay system developed by Webrecorder and supported by the web archiving community called Pywb. The IIPC, seeing a need for support for the potentially complex transition from OpenWayback to Pywb, funded development of a transition guide to help member institutions and others migrate to Pywb. While we weren’t initially ready to make the leap, we began to map out what it would take, and we participated in discussions and contributed some requirements to help make our eventual transition smoother.
I should also note that in order to render in replay tools, web archives also need CDX files (which we’ve talked about in the Signal before). CDX files are concatenated lines of metadata wherein each line represents a single object within a WARC file. Another big change for us to plan for involves these files: OpenWayback has a flat CDX structure, and we realized with our scale, and a transition to Pywb, we needed to replace the flat structure with OutbackCDX to more efficiently store and look up web archive data. That means re-ingesting all of our indexes for our billions of files that make up the four petabytes of web archives. As you might imagine, this will take some time!
So where are we today and where are we headed?
Some work has been done to stabilize the current OpenWayback replay in recent months, however, users may still notice intermittent outages and slow response times. We’ve put our monthly releases on hold right now.
Development work to make the transition to Pywb has been prioritized, and the Library is actively working to make this happen. We have some set-up to do still, and the ingest of indexes from 23 years of archiving is anticipated to take awhile, but we are excited to announce that we plan to have a beta Pywb replay tool available this fall. At that time, a limited amount of content will be available to view while we work to ingest the remaining indexes.
Despite these growing pains, we are actively crawling and adding to the Library of Congress web archives. Recent new collections in development include a Climate Change Web Archive, a Mass Communications Web Archive, and Voices: Eastern and Central European Americans Web Archive. These are just a few of the 81 active event and thematic collections that continue to collect and preserve web content.
Please watch this space for updates. We will make announcements about the beta and when the full migration is completed next year. Stay tuned!