What Do you Mean by Archive? Genres of Usage for Digital Preservers

One of the tricks to working in an interdisciplinary field like digital preservation is that all too often we can be using the same terms but not actually talking about the same things. In my opinion, the most fraught term in digital preservation discussions is “archive.” At this point, it has come to mean a lot of different things in different contexts. It can mean so many different things that some in digital preservation are reluctant to use the term writ large.  I wanted to spend a few moments putting text on a URL that anyone can reference from here on out when they need to try and parse and disambiguate what we mean by archive. For a some related reading, I’d suggest checking out Kate Theimer’s Archives in Context and as Context and the role of “the professional discipline” in archives and digital archives.

I’d stress here that I’m not really interested in telling people what is and isn’t an archive. Instead, I’m interested in 1) helping people ensure that they aren’t talking past each other and 2) briefly starting to suss out the resonances between these different usages. I would love to hear more perspectives on usage of the term and resonances between those uses in the comments. In many different contexts the term archive carries with it significant weight, the term often brings with it notions of longevity, safe keeping, order and concerns with authenticity, it’s about items or records that hang together for good reason. To varying extents, across each of the uses I articulate here I think we see these points surface. My objective here is not to exhaustively describe any of these ways the word is used, but just too briefly gesture toward different usages. I should stress that this is how I sort out some of the different usages of the terms. I invite readers to suggest additional and or different usages and comment these below the post.

Archive as in Records Management

Manuscript Division stacks with acid-free containers. Manuscript Division Slide Collection

Manuscript Division stacks with acid-free containers. Manuscript Division Slide Collection

In an organizational context, an archives is often the place in the organization that is required to retain and organize records of the organization. So a radio station, or a hospital, or a financial services company needs to keep around copies of records of its operation for a range of reasons (litigation, tax purposes, posterity, compliance with regulations, etc.). In this case, the archive serves the purpose of organizing, maintaining records and materials for use by the organization. In this case, a big part of the work of an archive is to make sure they are keeping around only what is deemed to be useful for particular future use cases.

Archive as in “The Papers of So and So”

One of the specific senses that archivists will use the term archives is to describe a particular kind of collection. Effectively, an archive is a kind of collection of materials that hang together for a very particular reason. An archive is either the papers of some particular person or the papers or records of a particular organization. What makes it an archive is the fact that the items and records in the collection represent “fonds” a particular name for a collection that are the result of the ongoing work of the individual or organization. The words “natural” and “organic” generally come into play here, the idea being that the archive is a collection of items and records that exist as a whole. To contrast with this, an archivist might refer to a collection of rare books pulled together by a collector over time an “artificial” collection. Artificial in this case is not to say that it’s “bad” just that the collection was assembled as a set of materials after the fact.

Archive as in “Right Click -> add to Archive”

Example of archive used in web mail.

Example of archive used in web mail.

For most people, the most common usage of the term archive is likely from a context menu in computing. In many operating systems you can simply right click on some icon for a file and click “add to archive” or “create archive.” In these cases, borrowing on a legacy of usage of the term more generally in computing, this ends up meaning stick it into some kind of compressed container file. In this vein, the term archive is largely tied to the idea of “back-up.” Effectively, the archived copy of these files is slightly more difficult to get to but right at your fingertips nonetheless.

Usage of the term in web applications, like web email clients, is very similar. In the case of many web mail systems the archive is simply all of your emails that you haven’t deleted and are not in your inbox. In the logic of “piling vs. filing” this actually makes sense. In the past, you might have organized your correspondence and bills in a particular and structured fashion, keeping only what you needed for the future and deliberately putting it where it would be easy to find in the future. That filing process for managing records is much more inline with what archivists mean by archive. As email has shifted further and further toward something that people expect to be able to simply do full text search against the term archive has come along with it, but the fact that folks now generally just let it pile up in one big thing called “archive” that they search against is very different from the deliberate organized thing that archivists are generally talking about.

Computer data storage in a modern office building, taken during the 1980s, Photographs in the Carol M. Highsmith Archive, Library of Congress, Prints and Photographs Division.

Archive as in “Tape Archive”

When IT people use the term archive they are generally talking about a piece of hardware. At the start of each of the Library of Congress storage architecture meetings we generally need to begin with this vocabulary discussion. As an example, many large organizations use a HSM, a hierarchical storage management system, that maintains different tiers of storage that have distinct performance requirements. At this point, the top level might be a relatively small amount of expensive but fast flash memory, below that might be a larger pool of spinning disk storage, below that you would likely find something called the “archive” layer. In this case, archive means tape archive. Magnetic tape remains the cheapest medium (you can store a lot more data on tape for a lower cost than disc) but it is significantly less responsive. So it is going to take you time to get the information back from tape. So within the design of a storage system, the stuff you need to keep around but don’t need to access that often, or your back up copies etc. ends up on the biggest but cheapest tier of your storage system.

The definition here relies on a long history of using the term archive as a synonym for magnetic tape storage systems. The file format .tar, a way to package data for storage, itself stands for “tape archive.” This use of the term archive goes back to 1940s computer systems architecture. In the original context it referenced online vs. offline storage. The reels of tape were quite literally “off line,”  the reel had to be located and mounted before data became accessible in contrast to things like a magnetic core at the time, and later random access memory.

Archive in “Web Archive”

Wendy's Blog: Legal Tags

Wendy’s Blog: Legal Tags, Legal Blawgs Web Archive, Law Library of Congress

Many organizations are now in the business of harvesting content from the web for long term access and preservation. In these cases, tools like Heritrix, an open source web webcrawler, are sent out to grab all of the rendered content of a webpage they can get ahold of  and, within defined parameters, the other pages that link to it and all their associated files. As part of this collection process, the tools log information about the date and time that the data was collected. At this point, tools store that content in WARC files, or Web Archive files, which can then be played back via tools like the Wayback machine. So there is a lot of information in here that can be used to assert the authenticity of the data, how a particular URL presented itself to Heritrix and how Heritrix interpreted it at a particular moment in time. With that said, it’s much more in keeping with the computing usage of archive as a back-up copy of information then the disciplinary perspective of archives.

Archive as in “Digital Archive”

At this point, there are a lot of digital collections that are using the term archive that don’t necessarily square with how archivists have been using the term. For instance, the September 11th Digital Archive, the Bracero Archive the The Shelley-Godwin Archive are good exemplars of some of the diversity of this usage. In each case, an effort was undertaken to bring collect or bring together related materials. The September 11th digital archive is a crowdsourced collection of materials related to the attacks, the Bracero Archive is a digitized collection of oral history interviews with individuals involved in the Bracero guest worker program, and The Shelley-Godwin Archive brings together digitized copies of primary manuscript sources related to a particular family. The origin of this usage is anchored in Jerome McGann’s work on the Rossetti Archive, which McGann had developed grounded in a theoretical perspective of the potential that hypermedia brought to allow for the creation of new kinds of archives. Alongside this usage, digital archive has also be used as a term to refer to born digital materials processed as part of a more traditional notion of an archive. In this case, see usage of “the born digital archives of Salman Rushdie.”

Some archives purists might call all of these “artificial” collections. I however wouldn’t. I don’t think this is so much about the computing terminology invading the space, but instead another tradition in which systematically collected materials have been called archives within cultural heritage organizations. Folklife archives, for example the American Folklife Center Archive, at the Library of Congress, have long worked to acquire ethnographic field collection’s for the archive.  In these cases, folklorists have gone out and made field recordings and then worked with archivists to organized them for access. With this said, its valuable to recognize that generally the term digital archive carries this language and meaning as opposed to the canonical repository for the “papers of so and so” or the records management terminology. That is, digital archives hang together as “a conscious weaving together of different representational media.” For another take on the idea of digital archives see Kate Theimer’s recent presentation at the American Historical Association’s annual meeting,  A Distinction worth Exploring: “Archives” and “Digital Historical Representations.”

Notions and Considerations of “The Archive”

The last category I am including here is about theorizing “the Archive.” A broad range of work in literary and media theory focuses attention on “the Archive.” Here I am thinking of Foucault’s notion of “the Archive” in The Archeology of Knowledge,” Derrida’s perspective in Archive Fever, and Kittler and Wolfgang Ernst’s notions of archives in Media Archeology. For the most part, this body of work is less about what goes on in an individual archives and is more about the role of “the archive” in society writ large or the idea of “the archive” as traces of the past in objects. For example, for Foucault, “the Archive” is not so much an individual set of materials but a term for the entirety of historical records/evidence that exists to work from. These theoretical takes on “the archive” can be frustrating to many archivists, as much of this work does not engaged with the professional practices of archives or with “archival theory,” the body of scholarship which archivists themselves have been building through ongoing practice and research since at least the French revolution.

At the institutional level, discussions of “the archive” are broadly useful for reflecting on the social roles that archives play in culture. Further, a considerable amount of this work in the Media Archeology and Media Theory traditions focus on processes of inscription and embedded logic of different media (optical media, gramaphones, databases, the MP3 format, etc) which are increasingly important genres of artifacts and records that archives are themselves tasked accessioning. Kirshenbaum’s Mechanisms: New Media and the Forensic Imagination is itself an invaluable exemplar of how work from these media theory traditions can combine with archival theory to produce scholarship that directly informs the development of tools and practices for practicing archivists. Again, these  broad and interdisciplinary conversation about archives can be quite useful to both those working in and outside archives.

So, are there other definitions I’m missing? Have I got any of the lineage wrong on this? I’d love to continue this discussion in the comments.

Thanks to Matthew Kirshenbaum, Nicki Saylor, and Kate Theimer for comments and suggestions for improvements to this post.