This is a guest post from Dinah Handel.
For the past six months, I’ve been immersed in the day to day operations of the archive and library at CUNY Television, a public television station located at the Graduate Center of the City University of New York.
My National Digital Stewardship residency has consisted of a variety of tasks, including writing and debugging the code that performs many of our archival operations, interviewing staff in and outside of the library and archives and migrating data and performing fixity checks.
This work all falls under three separate but related goals of the residency, stated in the initial project proposal in slightly different language, which are: document, make recommendations for, and implement changes to the media microservices based on local needs; work with project mentors to implement an open source digital asset management system; and verify the data integrity of digitized and born digital materials and create a workflow for the migration of 1 petabyte of data from LTO (linear tape open) 5 to LTO 7 tape.
Workflow Automation through Microservices
Media and metadata arrive at the library and archives from production teams and editors, and in the context of the Reference Model for an Open Archival Information System, this is considered the Submission Information Package. Our media microservice scripts take SIPs, and transcode and deliver access derivatives (Dissemination Information Packages or DIPs), create metadata and package all of these materials into our Archival Information Packages, which we write to long-term storage on LTO tapes.
Media microservices are a set of bash scripts that use open source software such as ffmpeg, mediainfo and others, to automate our archival and preservation workflow. A microservices framework means that each script accomplishes one task at a time, which allows for a sense of modularity in our archival workflow. We can change out and upgrade scripts as needed without overhauling our entire system and we aren’t reliant upon proprietary software to accomplish archiving and preservation. I wrote more about microservices on our NDSR NY blog.
One of my first tasks when I began my residency at CUNY Television was to make enhancements to our media microservices scripts, based on the needs of the library and archives staff. I had never worked with the bash or ffmpeg and — while it has been quite a learning curve — with a dash of impostor syndrome, I’ve made a significant amount of changes and enhancements to the media microservices scripts, and even written my own bash scripts (and also totally broke everything 1 million times).
The enhancements range from small stylistic changes to the creation of new microservices with the end goal of increasing automation in the processing and preservation of our AIPs. It’s been heartening to see the work that I do integrated into the archive’s daily operations and I feel lucky that I’m trusted to modify code and implement changes to the workflow of the archive and library.
In addition to making changes to the media microservices code, I also am tasked with creating documentation that outlines their functionality. I’ve been working on making this documentation general, despite the fact that the media microservices are tailored to our institution, because microservices can be used individually by anyone. My intention is to create materials that are clear and accessible explanations of the code and processes we use in our workflows.
Finally, I’ve also been working on creating internal documentation about how the library and archives functions. In some ways, writing documentation has been as challenging as writing code because explaining complex computer-based systems and how they interact with each other, in a narrative format, is tricky. I often wonder whether the language I am using will make sense to an outside or first-time reader or if the way I explain concepts is as clear to others as it is to me.
Digital Asset Management in a Production and Preservation Environment
CUNY Television’s library and archives are situated in both preservation and production environments, which means that in addition to migrating media from tape to digital files and archiving and preserving completed television programs, we are also concerned with the footage (we use the terms raw, remote, and B-roll to denote footage that is not made publicly available or digitized/migrated) that producers and editors use to create content.
Presently, much of that footage resides on a shared server and when producers are finished using a segment, they notify the library and archives and we move it from the server to long term storage on LTO. However, this process is not always streamlined and we would prefer to have all of this material stored in a way that makes it discoverable to producers.
Before I arrived at CUNY Television, the library and archive had chosen an open source digital asset management system and one of my tasks included assisting in its implementation. Our intention is that the DAM will house access copies of all of CUNY Television’s material: broadcast footage and born-digital completed television shows, migrated or digitized content and non-broadcast B-roll footage.
To get broadcasted shows uploaded to the DAM, I wrote a short bash script that queries the DAM and the server that we broadcast from, to determine which new shows have not yet been uploaded to the DAM. Then, I wrote another script that transcodes the access copies according to the correct specification for upload. The two of these scripts are combined so that if a video file is not yet on the DAM, it gets transcoded and delivered to a directory that is synced with the DAM and uploaded automatically.
The process of getting production materials into the DAM is much more difficult. Producers don’t necessarily follow file-naming conventions and they often store their materials in ways that make sense to them but don’t necessarily follow a structure that translates well to a DAM system.
After interviewing our production manager, and visiting Nicole Martin, the multimedia archivist and systems manager at Human Rights Watch (which uses the same DAM), we came up with a pilot plan to implement the DAM for production footage.
As I mentioned, it is possible to sync a directory with the DAM for automatic uploads. Our intention is to have producers deposit materials into a synced hierarchical directory structure, which will then get uploaded to the DAM. Using a field-mapping functionality, we’ll be able to organize materials based on the directory names. Depending on how the pilot goes with one producer’s materials, we could expand this method to include all production materials.
Data Integrity and Migration
Currently, much of my energy is focused on our data migration. At CUNY Television, we use LTO tape as our long-term storage solution. We have approximately 1 petabyte of data stored on LTO 5 tape that we plan to migrate to LTO 7. LTO 7 tapes are able to hold approximately 6 terabytes of uncompressed files, compared to the 1.5 terabyte the LTO 5 can hold, so they will cut down on the number of tapes we use.
Migrating will also allow us to send the LTO 5 tapes off-site, which gives us geographic separation of all of our preserved materials. The data migration is complex and has many moving parts, and I am researching and testing a workflow that will account for the variety of data that we’re migrating.
We’re pulling data from tapes that were written four years ago, using a different workflow than we currently use, so there are plenty of edge cases where the process of migration can get complicated. The data migration begins when we read data back from the LTO 5 tapes using rsync. There is an A Tape and a B Tape (LOCKSS), so we read both versions back to two separate “staging” hard drives.
Once we read data back from the tapes, we need to verify that it hasn’t degraded in any way. This verification is done in different ways depending on the contents of the tape. Some files were written to tape without checksums, so after reading the files back to the hard drive, we’ve been creating checksums for both the A Tape and B Tape, and comparing the checksums against one another. We’re also using ffmpeg to perform error testing on the files whose checksums do not verify.
This process repeats until there is enough to write to LTO 7. For some files we will just verify checksums and write them to LTO 7 and call it a day. For other files though, we will need to do some minimal re-processing and additional verification testing to ensure their data integrity. To do this, I’ve been working on a set of microservice scripts that update our archival information packages to current specifications, update metadata and create a METS file to go with the archival information package. As of this week, we’ve written one set of LTO 7 tapes successfully and, while we still have a long way to go towards a petabyte, it is exciting to be beginning this process.
Even though each of these project goals are separate, they influence and draw on one another. Microservices are always inherent within our workflows and making materials accessible via a DAM is also reliant upon creating access copies from materials stored on LTOs for the past 4 years. Adopting this holistic perspective has been enormously helpful in seeing digital preservation as an interconnected system, where there are system-wide implications in developing workflows.