This is a guest post by Stefanie Ramsay.
How do you capture, preserve and make accessible thousands of born-digital documents produced by state agencies, published to varying websites without any notification or consistency and which are often relocated or removed over time? This is the complex task that the State Library of Massachusetts faces in its legal digital preservation mandate.
My National Digital Stewardship residency involves conducting a comprehensive assessment of existing state publications, assessing how other state libraries and archives handle this challenge and establishing best practices and procedures for the preservation and accessibility of electronic state publications at the State Library. In this post, I’ll cover how we’ve approached this project, as well as our next steps.
State agencies publish thousands of documents for the public, and a legal mandate requires that they send this content to the library for preservation. Unfortunately this is a rare occurrence, leaving library staff to retrieve content using various other methods. The staff relies on a homegrown web crawler to capture publications from agency websites, but they also comb through individual agency pages and check social media and the news to spot mentions of agency publications.
Creative as these approaches may be, they do not form a sustainable practice for handling the large amounts of content that agencies produce. Before establishing a better workflow however, the library needed to get a better understanding of how much material is published, what kinds of material are published and how best to capture these materials for long-term access and preservation. Having this data, we can then begin to build an effective digital preservation program.
At the beginning of my residency, we began using web statistics collected from the Massachusetts governments’ main portal, Mass.gov. The statistics show publications requested by users of the site per month. Having the URL allows us to see where these are posted and to ascertain the types of documents agencies are publishing.
We found a wide range of documents, such as annual reports, meeting materials, Executive Orders, and my personal favorite, a guide to resolving conflict between people and beavers.
After categorizing the content, we needed to narrow down a collection scope. Rather than attempting to capture every publication, I thought it best to define what types of documents are most valuable to the staff and library users and to focus our efforts on those documents (which is not to say that the lower priority items will be ignored, but that the higher priority items will be handled first, then we will develop a plan for the rest). To determine what documents were high and low priorities, we implemented a ranking process.
Each staff member ranked the publications for individual agencies on a scale of 1-5 (1 being lowest priority, 5 being highest) on shared spreadsheets, and our collective averages started to filter what was most valuable. Documents such as reports, project documents and topical issues rose to the top, while items such as draft reports, requests for proposals and ephemeral information sunk to the low priority tier.
This process formed the basis of our collection policy statement to be used as a guide when identifying and selecting content for ingestion. This statement is regularly updated as we continue to determine our priorities. We also began collecting metrics on the total number of documents captured by the statistics and the number of documents that fell into each priority tier. This gives us a sense of the bigger picture of not only the amount of content, but how much needs to be handled quickly and forms the basis of an argument for increased resources.
This issue is not unique to Massachusetts; every state library has a mandate to capture state government information and every state takes a different approach based on their resources, staff expertise, and constituents. In my research of how other state libraries and archives handle this mandate, one common thread emerged: at least 24 states use Archive-It as a means for capturing digital content. I was eager to investigate this, as I hoped it could be another resource for Massachusetts to use as well.
The IT department of the Executive Branch of Massachusetts state government, MassIT, has an Archive-It account and has crawled Mass.gov since 2008. Though the account was publicly available, MassIT had not advertised the site, as their focus was on ensuring capture of content rather than accessibility. Seizing this opportunity for collaboration, we reached out to MassIT, who granted us access to the site. We worked together to customize the metadata and I wrote some language for the library’s website that provided instructions for our patrons on how to use Archive-It to access state publications.
Our situation is a bit different in that we do not exclusively maintain the Archive-It account. However, we are using this resource in a similar way to many other state libraries and archives. Archive-It will not be our main repository for accessing state publications– the library has a DSpace digital repository that has been in place since 2006 and will continue to be our central portal for providing enhanced access to high priority publications.
Archive-It will act as a service for crawling Mass.gov, thereby ensuring the capture of more documents than we could hope to collect on our own and allowing us another means of finding material we may not have captured in the past. Using the two in concert goes a long way towards meeting the legal mandate.
With just a few months left in the residency, there is still much work to be done. We’re testing a workflow for batch downloading PDFs using a Firefox add-on called DownThemAll!, investigating how to streamline the cataloging process and conducting outreach efforts to state agencies.
Outreach is crucial in raising awareness of the library’s resources and services, as well as in reminding agencies about that pesky law regarding their publications. These steps form the foundation of a more sustainable digital preservation program at the State Library.