The following is a guest post from Michael Neubert, a Supervisory Digital Projects Specialist at the Library of Congress.
Recently, I talked with Kristen Regina, Head of Archives and Special Collections at the Hillwood Estate, Museum and Gardens in northwest Washington and Jaime McCurry, Digital Assets Librarian, about workflows and issues for web archiving, an activity that they are looking at. What could I tell them based on LC’s experiences?
Hillwood Estate, Museum and Gardens opened as a public museum in 1977. It is the former residence of American businesswoman, socialite, philanthropist and collector Marjorie Merriweather Post, and home to one of the most comprehensive collections of Russian imperial art outside of Russia, a distinguished 18th-century French decorative art collection and twenty-five acres of landscaped gardens and natural woodlands.
According to Kristen and Jaime, Hillwood has identified digital stewardship as an area of great importance, both in its strategic planning efforts and in day-to-day operations. This fresh focus has supported the institution’s recent migration to a new digital asset management system, the continuation of digital partnerships such as Hillwood’s participation in the Google Art Project to encourage access to its rich digital resources, and moving forward, the exploration and creation of a well-rounded web archiving program.
As Kristen explained, the team has three specific activities in mind for web archiving:
- Archive Hillwood’s online presence, in particular its own website, http://www.hillwoodmuseum.org. The site would be archived on a regular basis to support traditional archival efforts related to the museum and its ongoing operational activities. This aligns with the usual reasons that an organization of this type would keep copies of brochures, publications, reports and so on that are provided to the public about the organization.
- Targeted harvesting of listings or digital catalogues of materials in scope for the Hillwood collections on websites such as dealers or auction houses such as Sotheby’s or Christie’s. Again, this mirrors the collecting of analogous paper materials.
- Harvesting on a continuing or one-time basis of sites (or more often parts of sites) of peer institutions, particularly in Russia, and web-based publications about Hillwood or topics relevant to its collections or collecting priorities.
One could come up with any number of challenges associated with each of these activities, but I was struck in thinking about these activities after our meeting that each had particularly distinct challenges for which the best solution might not contribute to solving either of the other two. This was in contrast to my usual thinking of web archiving problem solving.
Archiving the organization’s site: This is a fairly typical activity for many organizations nowadays and can easily be arranged with a vendor who will periodically “crawl” the organization’s web site from top to bottom, capturing as much of the site as is technically possible following browsable links. This can be done at whatever frequency is desired. The “traditional” approach however is to do a complete harvest of the site each time. Depending on the frequency of revision to the site overall, this can be a considerable amount of effort to make a copy of something that has only slightly changed since the earlier crawl. (The resulting files are de-duped, so at least duplicated copies of the site materials are not stored.) At the same time, a portion of the site might have any number of changes that would be missed between scheduled full harvests.
The solution in this situation is to have a completely different approach and to contract with an organization that can harvest those pages when changes are made on the basis of an RSS-like notification. In the case of a web site that is mostly unchanged over time, this would be much more economical, and yet at the same time would allow the assurance to organization management that the question, “what did our site look like on date X?” can be answered accurately in the future. A full crawl could be attempted once a year as a baseline.
Targeted crawling of certain types of materials from dealers and auction houses, complementing what other groups such as the New York Art Resources Consortium are collecting: To my mind, this kind of crawling presents a completely different challenge. Hillwood has a relatively narrow and specific collecting profile and most relevant auction houses will have much broader scope. If we assume that Hillwood (or other museums or cultural heritage institutions for that matter) would not want to harvest entire sites and then “throw away” what they don’t need, then the likely solution lies in collaborative effort by collecting institutions. Collaboration is a theme that seems to be gaining traction in discussions of web archiving, which is probably good, but at the same time it presents an entirely different set of challenges.
Harvesting sites of peer institutions, particularly in Russia, and web-based publishing about Hillwood or topics relevant to its collections: These activities seem closer to “traditional” web archiving as I think of them, but are still challenging for a small organization with a small staff. Hillwood’s focus will rarely align directly with other institutions’ so “scoping” a crawl of another organization’s site so as to just acquire relevant materials and not an entire organizational site would often be tricky. In addition, there is the ongoing challenge of identifying what these sites and materials might be, which requires staff attention – this seems the greatest challenge in fact here, to find the staff time to identify the sites and then scope them properly and later do the quality assurance review of the results.
Having worked on web archiving collection-building at the Library of Congress for about five years, I am increasingly struck by the singular nature of the web archiving tools. Perhaps this is reflective of the relatively youthful nature of the activity, perhaps it reflects a certain gratitude that there are tools that at least do one thing. But as I look at some of what we at the Library would like to expand our activities to do and talk to people like Kristen and Jaime, I learn about different use cases that lead me to think about different problems, both technical and organizational, than the ones we have focused on so far.