Today’s guest post is from Liz Holdzkom, Marcus Nappier and Kate Murray of the Digital Collections Management & Services Division and Ted Westervelt, Chief, US/Anglo Division at the Library of Congress.
As the Library of Congress expands its digital collecting activities, the Recommended Formats Statement (RFS) supports a structured methodology to assess the viability of digital formats. These efforts are part of the Library’s strategy to collect and engage fully with the breadth of digital creative works.
Background on the Recommended Formats Statement
The Recommended Formats Statement identifies hierarchies of the physical and technical characteristics of creative formats, both analog and digital, which will best meet the needs of creators, publishers, and cultural heritage institutions, maximizing the chances that creative content will survive and continue to be accessible well into the future.
The RFS continues to serve two primary functions related to how the Library plans for the preservation and access of materials: 1) provide internal guidance to inform acquisitions-related decisions, and 2) spread best practices for ensuring the preservation of, and long-term access to, the creative output of the nation and the world.
Changes for 2022-2023
The RFS is updated annually and this year brings some significant changes. The first is the addition of the new the Email content category which defines acceptable formats for both individual email messages and aggregated groups of messages. Note that there are no preferred formats listed for email at this time as The Library of Congress continues to assess and improve its capability to process, serve and preserve email collections. This new content category also defines the preferred metadata that should accompany email messages. See the email and PIM (Personal Information Manager) formats information on the Sustainability of Digital Formats site for more details about included formats including EML and MSG for individual email messages and PST and MBOX for compiled groups of messages (e.g., entire inbox or folder, as supported by an email client).
Another significant update to the RFS was the addition of selected information in the Datasets content area. Here, there are supplementary recommendations for acceptable formats for the aggregation or transfer of datasets. The RFS now designates ZIP, RAR, tar and 7z files as acceptable aggregates for datasets, however notes that these aggregates should be free of encryption, password, and other protection measures that would limit the successful transfer of data. For more about Aggregate formats, see Quality and Functionality Factors for Aggregate Formats.
In a related change, we have also updated our preferences for datasets metadata. It is preferred that manifests or file lists are included for any aggregations of datasets. Preferred metadata for datasets now also include file size and versioning information, such as date or version number, for all aggregate files.
Additionally, Web Archive Collection Zipped (WACZ) from the Webrecorder project has been added as an acceptable format for Web Archives along with CDX as a component file for WARC file content. And GeoJSON has moved from Acceptable to Preferred for Geographic Information System (GIS) – Vector Data.
One more important improvement is the creation of summary table listing all the digital file format information for all categories together in one place.