This blog post is co-authored by Carl Fleischhauer, Project Manager, Digital Initiatives, Library of Congress.
People who manage audio and video files over time, do create fixity data, aka hash values or checksums, to help monitor the condition of those files in storage and when moved from one system or media to another system or media. And they often do what others also do: create fixity data for a whole file and allow their data management system to retain and compare that historic data with a fresh calculation, to see if the file has changed and, if so, to take remedial action. The NDSA Infrastructure and Standards and Practices working groups’ Checking Your Digital Content: How, What and When to Check Fixity paper summarizes these concepts. But people who produce audio and video files, and those who manage them, often also create fixity data on parts of files and sometimes that data is embedded in the file. What gives?
An increasing practice within the audiovisual community is to generate fixity values at the intra-file or part level in addition to the whole file. Archivist and technologist Dave Rice calls these “little checksums for large files” and explains that “by producing checksums on a more granular level, such as per frame, it is more feasible to assess the extent or location of digital change in the event of a checksum mismatch.” These little checksums allow for monitoring changes in specific parts of the file. Imagine establishing separate fixity value on the whole file and the data only part of a file. Changes made to the embedded metadata would change the whole file fixity value but not that of the video only data, establishing proof that the video data is unchanged.
Practice varies about the level of granularity but three common parts are the container or “chunk”, frame and smaller-than-frame levels. A well-known example of container or “chunk” level hashing is the audio chunk in RIFF-based Broadcast Wave files. The Federal Agencies Digitization Guideline Initiative report on Embedding Metadata in Digital Audio Files: Introductory Discussion (PDF) explains that creating an “audio-data-only [MD5] checksum (including the entire <data> chunk, excluding the chunk id, size declaration, and any optional padding byte) the audio portion of the file which helps validate the integrity of the audio but allows for alteration of the metadata.” Even more granular is frame-level fixity data, such as found in FLAC. More granular still are implementations at the V (or data value) of the KLV triplet structure found in digital cinema packages as defined in SMPTE ST 429-6:2006.
Options for intra-file fixity values in MXF files are an especially interesting bunch. The BBC’s implementation of MXF (PDF), for example, uses both per-track and per-frame checksums (in addition to whole file checksums). MXF also permits the option for “edit unit” level checksums, a term used in SMPTE MXF (defined by ST 377-1) to define one content package. The FADGI AS-07 MXF application specification permits one checksum value for every V data unit in the KLV triplets that represent progressive scanned picture data frame-wrapped essences. Picture data for interlaced video, including JPEG 2000 compressed video case I2 (frame wrapping, interlaced two fields per KLV triplet, defined by SMPTE ST 422:2013), will very often be carried with the data from both fields represented as a single V in a KLV triplet. The checksum value is calculated on the concatenated values of the two Vs in the pair of KLVs.
Intra-file fixity values in audio and video files serve roles beyond those of whole file checksums including process monitoring. The first process monitoring use case is creating the fixity value as early in the process as possible – ideally at the time of initial creation. The FADGI report on audio Interstitial Errors documents the problems of writing to disk after initial creation. In short, errors in the digital data can creep in to the chain almost immediately, starting with the movement from the analog to digital converter to the digital audio workstation and in the DAW’s writing of the file to a storage medium. A fixity value at time of creation compared with a fixity value created by the DAW or at the storage destination could be used to validate a successful capture of the data and of the file’s movement through the workflow.
Another process monitoring use case is confirming whether a losslessly transcoded file represents the original encoded data authentically. Dave Rice in Reconsidering the Checksum for Audiovisual Preservation explains that “a preservation-suitable lossless audiovisual encoding should decode to the same data that the original source would decode to, meaning that each pixel, frame, and timing decoded from the lossless version should be the same as the decoded original.” In other words, the frame-level fixity values of an uncompressed file should be identical to the frame-level fixity values of a losslessly compressed version of the same file if the compression is truly mathematically lossless.
Clearly, these little checksums are valuable instruments and it’s getting easier to create and store them. Some workflows store intra-file fixity values within the file itself. This is of course impossible to do with whole file fixity values because the recursive nature of the process. Embedding a new value in the file changes the file itself so the fixity value could never “catch up.” Intra-file level checksums however can be embedded. The FADGI-created BWF MetaEdit tool includes an option to create an MD5 checksum on the audio data chunk of Broadcast Wave file and puts it in the MD5 chunk within the file using the id <MD5 >. The National Archives and Records Administration’s AVI MetaEdit tool does the same for the video data chunk of AVI files. FLAC also allows internal storage of an MD5 checksum value of audio only data in the STREAMINFO block.
The BBC Ingex system embeds (PDF) frame-level checksum values in the MXF Generic Container System Item (PDF). The FADGI AS-07 MXF application specification will recommend storing the fixity value in the Generic Container System Item but will also permit storing the value within the KLV triplet, following the model of DCPs.
Toolsets like FFmpeg include capability for two types of checksums at the frame level. Framecrccreates Adler-32 CRC checksums while framemd5 creates MD5 hash values for each audio and video packet.
Of course, the audiovisual community isn’t the only one using intra-file checksums or even checksums for process monitoring. One such example is the frame check sequence in networked storage just for starters. The concepts behind checksums on parts of files are extensible to other file types too. There’s nothing stopping say, the imaging community from embedding checksums on specifically EXIF metadata in TIFF files to help establish if that data is altered or stripped during processing.