Improving Technical Options for Audiovisual Collections Through the PREFORMA Project

The digital preservation community is a connected and collaborative one. I first heard about the Europe-based PREFORMA project last summer at a Federal Agencies Digitization Guidelines Initiative meeting when we were discussing the Digital File Formats for Videotape Reformatting comparison matrix. My interest was piqued because I heard about their incorporation of FFV1 and Matroska, both included in our matrix but not yet well adopted within the federal community. I was drawn first to PREFORMA’s format standardization efforts – Disclosure and Adoption are two of the sustainability factors we use to evaluate digital formats on the Sustainability of Digital Formats website – but the wider goals of the project are equally interesting.

In this interview, I was excited to learn more about the PREFORMA project from MediaConch’s Project Manager Dave Rice and Archivist Ashley Blewer.

Kate: Tell me about the goals of the PREFORMA project and how you both got involved. What are your specific roles?

MediaConch Project Manager Dave Rice. Photo courtesy of Dave Rice

Dave: The goals of the PREFORMA project are best summarized by their foundational document called the PREFORMA Challenge Brief (PDF). The Brief describes an objective to “establish a set of tools and procedures for gaining full control over the technical properties of digital content intended for long-term preservation by memory institutions”. The brief recognizes that although memory institutions have honed decades of expertise for the preservation of specific materials, we need additional tools and knowledge to achieve the same level of preservation control with digital audiovisual files.

For initial work, the PREFORMA consortium selected several file formats including TIFF, PDF/A, lossless FFV1 video, the Matroska container, and PCM audio. After a comprehensive proposal process, three suppliers were selected to move forward with development. A project called VeraPDF focusing on PDF/A is led by a consortium comprised of Open Preservation Foundation, PDF Association, Digital Preservation Coalition, Dual Lab, and KEEP SOLUTIONS. The TIFF format is addressed by DPF Manager led by Easy Innova. Ashley and I work as part of the MediaArea.net team. Our project is called MediaConch and focuses on the selected audiovisual formats: Matroska, FFV1, and PCM. MediaArea is led by Jérôme Martinez, who is the originator and principal developer of MediaInfo.

AshleyBlewer

MediaConch Archivist Ashley Blewer. Photo courtesy of Ashley Blewer.

Ashley: Dave and Jérôme have collaborated in the past on open source software projects such as BWF MetaEdit (developed by AudioVisual Preservation Solutions as part of a FADGI initiative to support embedded metadata) and QCTools. QCTools, developed by BAVC with support from the National Endowment for the Humanities, was profiled in a blog post last year. Dave had also brought me in to do some work on the documentation and design of QCTools. When QCTools development was wrapping up, we submitted a proposal to PREFORMA and were accepted into the initial design phase. During that phase, we competed with other teams to deliver the software structure and design. We were then invited to continue to Phase II of the project: the development prototyping stage. We are currently in month seven (out of 22) of this second phase.

The majority of the work happens in Europe, which is where the software development team is based. Jérôme Martinez is the technical lead of the project. Guillaume Roques works on MediaConchOnline, database management, and performance optimization. Florent Tribouilloy develops the graphical user interface, reporting, and metadata extraction.

Here in the U.S., Dave Rice works as project manager and leads the team in optimizations for archival practice, system OAIS compliance, and format standardization. Erik Piil focuses on technical writing, creation of test files, and file analysis. Tessa Fallon leads community outreach and standards organization, mostly involving our plans to improve the standards documentation for both the Matroska and FFV1 formats through the Internet Engineering Task Force. I work on documentation, design and user experience, as well as some web development. Our roles are somewhat fluid, and often we will each contribute to tasks such as analyzing bitstream trace outputs to writing press releases for the latest software features.

PREFORMA-LOGO

PREFORMA: PREservation FORMAts for culture information/e-archives

Kate: The standardization of digital formats is a key piece in the PREFORMA puzzle as well as being something we consider when evaluating the Disclosure factor in the Sustainability of Digital Formats website. What’s behind the decision to pursue standardization through the Internet Engineering Task Force instead of an organization like the Society of Motion Picture and Television Engineers? What’s the process like and where are you now in the sequence of events? From the PREFORMA perspective, what’s to be gained through standardization?

Dave: A central aspect of the PREFORMA project is to create a conformance checker that would be able to process files and report on the state to which they deviate or conform to their associated specification. Early in the development of our proposal for Matroska and FFV1, we realized that the state of the specification compromised how effectively and precisely we could create a conformance checker. Additionally as we interviewed many archives that were using FFV1 and/or Matroska for preservation we found that the state of the standardization of these formats was the most shared concern. This research led us to include efforts towards facilitating the further standardization of both FFV1 and Matroska through an open standards body into our proposal. After reaching agreement from the FFmpeg and Matroska communities, we developed a standardization plan (PDF), which was included in our overall proposal.

As several standards organizations were considered, it was important to gain feedback on the process from several stakeholder communities. These discussions informed our decision to approach the IETF, which appeared the most appropriate for the project needs as well as the needs of our communities. The PREFORMA project is designed with significant emphasis and mandate on an open source approach, including not only the licensing requirements of the results, but also a working environment that promotes disclosure, transparency, participation, and oversight. The IETF subscribes to these same ideals; the standards documents are freely and easily available without restrictive licensing and much of the procedure behind the standardization is open to research and review.

The IETF also strives to promote involvement and participation; their recent conferences include IRC channels, audio stream, video streams per meeting and an assigned IRC channel representative to facilitate communication between the room and virtual attendees. In addition to these attributes, the format communities involved (Matroska, FFmpeg, and libav) were already familiar with the IETF from earlier and ongoing efforts to standardize open audiovisual formats such as Opus and Daala. Through an early discovery process we gathered the requirements and qualities needed in a successful standardization process for Matroska and FFV1 from memory institutions, format authors, format implementation communities, and related technical communities. From here we assessed standards bodies according to traits such as disclosure, transparency, open participation, and freedom in licensing, confirming that IETF is the most appropriate venue for standardizing Matroska and FFV1 for preservation use.

At this stage of the process we presented our proposal for standardization of Matroska and FFV1 standardization at the July 2015 IETF93 conference. After soliciting additional input and feedback from IETF members and the development communities, we have a proposed working group charter under consideration that encompasses FFV1, Matroska, and FLAC. If accepted, this will provide a venue for the ongoing standardization work on these formats towards the specific goals of the charter.

I should point out that other PREFORMA projects are involved in standardization efforts as well. The Easy Innova team are working on furthering TIFF standardization in their TIFF/A initiative.

Kate: Let’s talk about two formats of interest for this project, FFV1 and Matroska. What are some of the unique features of these formats that make them viable for preservation use and for the goals of PREFORMA?

PREFORMA4

Initial draft of MediaConch IETF process.

Dave: FFV1 is a very efficient lossless video codec from the FFmpeg project that is designed in a manner responsive to the requirements of digital preservation. A number of archivists participated and reviewed efforts to design, standardize, and test FFV1 version 3. The new features in FFV1 version 3 included more self-descriptive properties to store its own information regarding field dominance, aspect ratio, and colorspace so that it is not reliant on a container format to store this information. Other codecs that rely heavily on its container for technical description often face interoperability challenges. FFV1 version 3 also facilitates storage of cyclic redundancy checks in frame headers to allow verification of the encoded data and stores error status messages. FFV1 version 3 is also a very flexible codec allowing adjustments to the encoding process based on different priorities such as size efficiency, data resilience, or encoding speed. For the past year or two, FFV1 may be seen at a tipping point for preservation use. Its speed, accessibility, and digital preservation features make it an increasingly attractive option for lossless video encoding that can be found in more and more large scale projects; the standardization of FFV1 through an open standards organization certainly plays a significant role in the consideration of FFV1 as a preservation option.

Matroska is an open-licensed audiovisual container format with extensive and flexible features and an active user community. The format is supported by a set of core utilities for manipulating and assessing Matroska files, such as mkvtoolnix and mkvalidator. Matroska is based on EBML, Extensible Binary Meta Language. An EBML file is comprised of one of many defined “Elements”. Each element is comprised of an identifier, a value that notes the size of the element’s data payload, and the data payload itself. Matroska integrates a flexible and semantically comprehensive hierarchical metadata structure as well as digital preservation features such as the ability to provide CRC checksums internally per selected elements. Because of its ability to use internal, regional CRC protection it is possible to update a Matroska file to log OAIS events without any compromise to the fixity of its audiovisual payload. Standardization efforts are currently renewed with an initial focus on Matroska’s underlying EBML format. For those who would like to participate I’d recommend contributing to the EBML specification GitHub repository or joining the matroska-devel mailing list.

Ashley: Matroska is especially appealing to me as a former cataloger and someone who has migrated data between metadata management systems because of its inherent ability to store a large breadth of descriptive metadata within the file itself. Archivists can integrate content descriptions directly into files. In the event of a metadata management software sunsetting or potential loss occurring during the file’s lifetime of duplication and migration, the file itself can still harbor all the necessary intellectual details required to understand the content.

PREFORMA2

MediaConch’s plan to integrate into OAIS workflows.

It’s great to have those self-checking mechanisms in place to set and verify fixity inherently built into a file format’s infrastructure instead of requiring an archivist to do supplemental work on top by storing technical requirements, checksums, and descriptive metadata alongside a file for preservation purposes. By using Matroska and FFV1 together, an archivist can get full coverage of every aspect of the file. And if fixity fails, the point where that failure occurs can be easily pinpointed. This level of precision is ideal for preservation and as harbinger for archivists in the future. Since error warnings can be frame/slice-level specific, assessing problems becomes much easier. It’s like being able to use a microscope to analyze a record instead of being limited to plain eyesight. It avoids the problem of “I have a file, it’s not validating against a checksum that represents the entirety of a file, and it’s a 2 hour long video. Where do I begin in diagnosing this problem?”

Kate: What communities are currently using them? Would it be fair to say that ffv1 and Matroska are still emerging formats in terms of adoption in the US?

Ashley: Indiana University has embarked upon a project to digitally preserve all of its significant audio and video recordings in the next four years. Mike Casey, director of technical operations for the Media Preservation Initiative project confirmed in a personal email that “after careful examination of the available options for video digitization formats, we have selected FFV1 in combination with Matroska for our video preservation master files.”

Dave: The Wikipedia page for FFV1 has an initial list of institutions using or considering FFV1. Naturally users do not need to announce publicly that they use it but there’s been an increase in messages to related communities forums.

PREFORMA1

Plan to integrate into the open source community/outreach strategy

Kate: Do you expect that the IETF standardization process will likely help increase adoption?

Ashley: I think a lot of people are unsure of these formats because they aren’t currently backed by a standards body. Matroska has been around for a long time and is a sturdy open source format. Open source software can have great community support but getting institutional support isn’t usually a priority. We have been investing time into clarifying the Matroska technical specifications in anticipation of a future release.

The harder case to be made regarding adoption in libraries and archives is with FFV1, as this codec is relatively new, less familiar, and has yet to be fully standardized. Access to creating FFV1 encoded files is limited to people with a lot of technical knowledge.

Kate: One of my favorite parts of my job is playing format detective in which I use a set of specialized tools to determine what the file is – the file extension isn’t always a reliable or specific enough marker – and if the file has been produced according to the specifications of a standard file format. But the digital preservation community needs more flexible and more accurate format identification and conformance toolsets. How will PREFORMA contribute to the toolset canon?

Ashley: The initial development with MediaConch began with creating an extension of MediaInfo, which is already heavily integrated into many institutions in the public and private sectors as a microservice to gather information about media files. The MediaConch software will go beyond just providing useful information about the file and help ensure that the file is what it says it is and can continually be checked through routine services to ensure the file’s integrity far into the future.

PREFORMA3

MediaConch GUI with policy editor displaying parameters.

A major goal for PREFORMA is the extensibility of the software being developed — working across all computer platforms, working to check files at the item level or in batches, and cross-comparability between the different formats. We collaborate with Easy Innova and veraPDF to discover and implement compatible methods of file checking. The intent is to avoid creating a tool that exists within a silo. Even though we are three teams working on different formats, we can, in the end, be compatible through API endpoints, not just for the three funded teams but to other specialized tools or archival management programs like Archivematica. Keeping the software open source for future accessibility and development is not optional — it’s required by the PREFORMA tender.

Dave: Determining if a file has been produced according to the specifications of a standard file format is a central issue to PREFORMA and unfortunately there are not nearly enough tools to do so. I credit Matroska for developing a utility, mkvalidate, alongside the development of their format specifications, but to have this type of conformance utility accompany the specification is unfortunately a rarity.

Our current role in the PREFORMA project is fairly specific to certain formats but there are some components of the project which contribute to file format investigation. Already we have released a new technical metadata report, MediaTrace, which may be generated via MediaInfo or MediaConch. The MediaTrace report will help with advanced ‘format detective’ investigations as it presents the entire structure of an audiovisual file in an orderly way. The report may be used directly, but within our PREFORMA project it plays a crucial role in supporting conformance checks of Matroska. MediaConch is additionally able to display the structure of Matroska files and will eventually allow metadata fixes and repairs to both Matroska and FFV1.

MediaArea seeks input and feedback on the standard, specifications and future of each format for future development of the preservation-standard conformance checker software. If you work with these formats and are interested in contributing your requirements and/or test files, please contact us at info@mediaarea.net.