The Digital Content Management section has been working on a project to extract and make available sets of files from the Library’s significant web archives holdings. This is another step to explore the web archives and make them more widely accessible and usable. Our aim in creating these sets is to identify reusable, “real world” content in the Library’s digital collections, which we can provide for public access. The outcome of the project will be a series of datasets, each containing 1,000 files of related media types selected from .gov domains. We will announce and explore these datasets here on The Signal, and the data will be made available through LC Labs. Although we invite usage and interest from a wide range of digital enthusiasts, we are particularly hoping to interest practitioners and scholars working on digital preservation education and digital scholarship projects.
Introduction to the File Datasets
The process to create datasets of the various media types follows a common set of parameters. These sets are not intended as exemplars or test sets, but rather as randomly selected samples that represent aspects of a larger, curated corpus. We do, however, hope that they offer a representative cross-section from a limited sampling frame. The first limit was the larger corpus. The Library seeks to collect sites ”broadly,” though not comprehensively, from all the branches of the federal government, and so this dataset initiative amplifies one of our collecting emphases. For this dataset, we selected items that were posted on publicly accessible US .gov domains at the time that the Library archived the resources.
We further limited our files by media type, which was identified according to metadata recorded in web harvesting logs. This metadata is requested from the source site when harvested, and it has not been further validated. This information is generally asserted by the provider’s servers and systems, just as it would be provided on the live web. Since the value may or may not be accurate, this adds an interesting layer to the sets since they may be used to further explore the level of accuracy of this supplied technical information. An example of the media type information that might be received is illustrated in the “Response header information,” where the media type
application/pdf, the selector for this dataset, is highlighted (see figure). From the file populations limited by domain and media type, we randomly selected 1,000 items from the sample. We plan to release additional sets over the coming year, including various office and data documents, audio, video, and other formats. The PDF set is available here.
Each set will be packaged in a consistent structure and derived from comparable data about the web archives. Each set will be packaged in a ZIP file. The contents will be structured according to BagIt and include fixity information about the contents, a CSV with metadata, a README, and a subfolder that will include the set of files. The methods for creating these sets is detailed further in each accompanying README file.
Understanding the 1,000 PDF Set
The first installment in the series is a set of 1,000 PDF files, randomly selected from .gov domain sites. The set may be described in many ways, but here are a few salient factors about the set’s technical characteristics. Uncompressed, the 1,000 .gov PDFs comprise 827.5 Megabytes. The PDFs were harvested during web crawls conducted over two decades, from 1996 to 2017, with significant peaks in 2009 and 2010 (each of those years saw nearly 200 PDFs harvested).
The creation dates extracted from the files suggest that the oldest file in the set was created in 1974, the most recent in 2017. This illustrates one of the challenges of the metadata. According to the published documentation about the PDF family, we know that PDFs weren’t created until the mid 1990s. Why, then, are there dates from the 70s and 80s? When we looked at the 1974 example, we noticed that it was a scan of a memo written in 1974. Someone had entered metadata about the original document, so this information is correct, indeed the source document dates to 1974, but it was misleading in the sense that the date was not when the PDF itself was created. Presumably many of the other dates were automatically generated when the file was created.
What else can we find out from the dataset’s metadata? The source domains mirror the Library’s collecting approach. Although the source domains range from Federal to state government sites, there are notable emphases on domains associated with the US Congress (the Library of Congress archives all web pages for members of the House of Representatives and the Senate, as well as House and Senate Committee websites), as well as domains associated with the Government Publishing Office (gpo.gov). The set appears to include PDF files of many versions, from 1.0 to 1.7 (as purported in the extracted metadata), although about half are version 1.4 or 1.5 (together, 534 files).
While most of the documents have one (339) or two (208) pages, there is a large spectrum of document lengths, including one with a maximum length of 1,168 pages.
We expect that many other observations could be made about these files, which could reveal further insights about their form and content as government documents. We would welcome sharing of any insights that you might find, in the comments below.
Using the Sets
We envision many possible uses for this and subsequent sets. For example, those who download the set may be able to use it for testing workflows in the processing of digital content or collections. Likewise, it may be used to investigate various methods of file characterization and analysis for technical metadata. Since the files in these sets will be selected entirely from government entities, these may be of particular interest to government document librarians, in testing tools or experimenting with ongoing work to identify and auto-categorize documents. Digital preservation and iSchool educators may be interested to use the sets as examples for describing and processing collections with specific content. And of course, while we have suggested these possible uses, there are no doubt many other uses that we have not imagined. We would encourage exploration of the datasets and the accompanying metadata in your own experiments or work to analyze digital content.
This is the first installment of a series of datasets, which we will be posting about over the next year on The Signal. This project is part of our growing efforts to encourage innovation with the Library’s collections, connect with audiences, and to throw open the treasure chest of the Library’s collections. As more sets become available, we would invite you to link to them (https://labs.loc.gov/experiments/webarchive-datasets/), download them, and explore them. We hope that they will be of use to the communities that we have already noted, as well as to those that we have not yet considered.