Today’s guest post is from Andrew Cassidy-Amstutz, Kate Murray, Marcus Nappier, Camille Salas, and Trevor Owens of the Library of Congress.
The Library of Congress recently completed a project to analyze the technical characteristics of a substantial set of eBook and eJournal files in the permanent collection and available for onsite access in Stacks, the Library’s access system for rights restricted content. These files were selected because they contain embedded data such as audio, video, and other interactive features that are not fully transparent. The research and resulting analysis from this project will inform current action plans for access and preservation. The culmination of this work was a virtual project summit hosted by the Library of Congress on April 13, 2023, entitled “The Next Chapter: Results and Recommendations from an Analysis of eBook and eJournal Content at the Library of Congress.” The summit brought together experts from other libraries, academic institutions, and the publishing industry, who engaged in a lively discussion about the findings and recommendations for the broader community.
Through a contract with Digital Bedrock, the project was started last June with an analysis of 150,000 files made up of formats such as EPUB, PDF, HTML, and XML. The results of the research were a technical file analysis, characterization and rendering tools comparison matrices, a tools gap analysis, and the Library hosted summit.
Linda Tadic and Henry Rosen from Digital Bedrock kicked off the summit by first sharing out about the findings from the analysis of the delivered files, the majority of the which successfully render in Stacks. As anticipated in advance of this research project, many of the files were embedded with additional data such as audio and video. From the analysis, it was also demonstrated that the embedded files support and enhance access to the Library’s digital collections. Overall, the findings suggested a higher degree of confidence that existing digital preservation practices would provide enduring access to these materials in Stacks.
Digital Bedrock also discussed the tools used during the analysis and the purposes for each tool. One significant theme they stressed is that no single tool is adequate enough to obtain technical characteristic information across all of the provided file formats. These tools included EPUBCheck, ExifTool. veraPDF, MediaInfo, and many others. It was only through the use of multiple and complementary tools that such a robust analysis was able to be performed. These are incredibly useful tools to have in the toolbox and the Library possesses tools and techniques to repair files as needed, but this was not needed for the majority of the delivered files. Digital Bedrock also offered a detailed review of their recommendations for the Library and broader digital preservation community focused on three major activities: acquisition of the content, accessibility, and preservation.
Other featured speakers included Tim Allison, Duff Johnson, and Maureen Pennock, who presented lightning talks that addressed, complemented, and expanded on Digital Bedrock’s findings. Duff Johnson stressed the importance of agreement on terminology as part of such analysis and cited the term “validation” and its meaning in a particular context. For example, developers use of the term “validation” to reflect a file’s usefulness including renderability, presence of viruses or personal identifiable information (PII). Digital preservation practitioners may use “validation” more so as the confirmation that the file meets the technical specifications for a given format. PDF versions and their relevance to digital preservation workflows also became a major discussion point and was proposed for a more in-depth discussion as a Digital Preservation Coalition (DPC) Connect topic this year. Tim Allison spoke about Apache Tika and various tools to extract data from PDFs, which reinforced many of the talking points from Digital Bedrock’s presentation. Our final lightning talk speaker, Maureen Pennock, discussed the British Library’s work with eBook collections and detailed research and sustainability assessments conducted in 2014. She also highlighted some of the British Library’s ongoing work to render EPUB3 files and make eBooks accessible via mobile apps. The lightning talks ultimately highlighted and reinforced the digital preservation issues that were raised in the project’s analysis.
The connection and engagement with community experts to exchange findings and experiences served as a key driver of this summit and the discussion and questions raised from all of the presentations was a true bright spot. The summit ultimately embodied the spirit of the Library’s goal of “We Will Connect: Drive momentum in our communities.” As such, the connection and engagement with community experts to share findings and experiences spurred great conversations and recommendations that all participants seemed interested in advancing including the following:
- Review Existing Tool and Resource Community: Numerous summit attendees agreed that maintaining a repository of resources and tools for fixing problematic eBook and eJournal content would be beneficial for the broader community, particularly content with rendering issues. Reviewing and adding to the COPTR Tools Grid could be a great starting point for this proposed next step.
- Consider and Implement Future Tool Improvements: As a result of the summit, improvements have been made to Apache Tika that improve core tools used for this kind of analysis. Apache Tika 2.8.0 has been implemented to include detection of PDF incremental updates as well as providing users the ability to instruct Tika to parse and extract text and metadata from incremental updates.
- Plan for Continued International Discussions on PDF Versioning for Digital Preservation Workflows: Several attendees suggested that the topic of PDF versioning could serve as focal point for a future DPC discussion as it was asked, “In which workflows does PDF version number matter?” There was considerable agreement about the need to address metadata capture on PDF-feature sets and moving beyond PDF version numbers.
We are incredibly appreciative of all presenters and attendees and were impressed by the level of engagement and lively discussion that took place. It was a pleasure to see and hear from so many experts and professionals from various backgrounds coming together to share their knowledge and ideas. Since this summit spurred attendees to continue conversations about these issues, we hope that this will lead to continued innovative approaches to preserve these materials.