LC Labs’ Computing Cultural Heritage in the Cloud (CCHC) initiative explores pathways for the Library to deliver its digital collections at scale, using a cloud computing environment. You can read more in previous posts about the initiative.
Earlier this year, LC Labs worked with three research fellows in digital history, digital art history, and software librarianship on individual computational research projects. Computational research applies computing processes like algorithms to traditional research topics, such as the study of history. For example, digital history researchers often use computational methods to uncover relationships between historic materials, visualize the contents of those materials, or even make them easier to find. Each of the researchers applied computational methods to a unique topic of their interest. From exploring the relationships among collection items with the help of neural nets, to making images easier to find with the assistance of computer vision, to building methods for identifying biblical quotations across collection materials, the researchers’ approaches reveal new ways of seeing collection materials.
On the Library side, CCHC staff were interested in better understanding the researchers’ methods for accessing, transforming, and representing data derived from the Library’s collections. These collaborations were a unique opportunity for the Library to understand firsthand how to support patrons who wish to approach library collections computationally in pursuit of broader research questions.
The CCHC Researchers’ Work
Andromeda Yelton: Exploring and Mapping Relationships between Collection Items
Andromeda Yelton developed a website that shows similar documents from the Reconstruction era (1855-1877) in the Library’s collections. For example, users might explore the relationships between these materials and keywords such as “education” or “suffrage.”
Based on the full text of 590,000 Library of Congress collection items, Yelton trained a neural net to identify similarities in the form and content of the materials. A neural net is a representation of the relationships between data elements. She used a tool called Doc2Vec, which detects similarities between words based on their context in text. Her visualizer shows the similarity between documents based on the proximity of dots representing the items.
Some of Andromeda’s reflections focused on the experience of research and the challenges she encountered in preparing and normalizing data, including:
- The complicated process of filtering out OCR errors
- Trying out various queries to identify Reconstruction era results from the loc.gov API
She considered questions such as:
- How do you create a dataset containing all Library items relevant to a particular topic, such as Reconstruction?
- How do the online text representations of materials differ from the original document texts?
The code for Andromeda’s work on Situating Ourselves in Cultural Heritage is available on GitHub.
Lincoln Mullen: Finding Biblical Quotations across the Library’s Collections
Dr. Lincoln Mullen joined the CCHC initiative to extend work on his project, America’s Public Bible. He developed software to create a copy of all of the Library’s digital collections into a local database. This software continuously checks for new text items in all collections, using the loc.gov API to gather that data. His software was written in the Go programming language and was created to work within a cloud computing environment, as well as to be extended in other programming languages.
Mullen wanted to address a problem common to computational researchers: the digital collections they want to analyze continue to grow in scope. He also wanted to address the challenge of performing computations across multiple collections.
He found that the software:
- Was capable of identifying quotations across many Library of Congress collections
- Identified over 232,290 unique combinations of references and items containing biblical quotations
His work for this project is available on GitHub.
Lauren Tilton: Using Computer Vision to Analyze Photo Collections
Dr. Lauren Tilton and her colleague, Dr. Taylor Arnold, undertook research entitled Access and Discovery of Documentary Images (ADDI). This work assessed visual recognition technologies and culminated in the development of a prototype that showcases photographic collections, while detecting the objects, faces, and poses that are present in each photo. Tilton and Arnold applied algorithmic methods such as face detection and object detection and explored ways to augment image metadata by producing annotations, such as faces or objects like cars. Being able to automatically identify what is in an image can make it easier to access and discover photographs. The visualizer also makes recommendations for other photos that contain similar objects or appear in a similar photographic style.
Using these methodologies, they processed about 300,000 photographs from five 20th century collections within the Prints and Photographs division: the Detroit Publishing Company, Farm Security Administration/Office of War Information Black-and-White Negatives and Color Photographs, George Grantham Bain, Harris and Ewing, and the National Photo Company.
One interesting thing they learned is that region segmentation, the idea of zooming out of a photo to identify all items within that photo, works incredibly well for annotating objects such as buildings, the sky, or the road in historical images.
The code for the ADDI visualizer prototype can be found on GitHub.
Connecting with Curatorial Expertise
The researchers’ decision-making was heavily informed by the generosity, experience, and subject matter expertise of Library of Congress staff. Curation decisions, such as how items were digitized or differences in metadata across Library systems can impact computational outcomes; and insight into how data are created can inform how the researchers need to modify algorithms to effectively use the material and data representing them. Conversations with staff holding diverse roles were crucial in helping the researchers understand how image and text data is structured and how to access item information that was needed by the algorithm.
Mullen said that the varied conversations he had with staff—from helping him understand the JSON output from the loc.gov API to understanding how the Library defines a collection—were helpful in understanding the dimensions of materials that were available to use. Mullen also indicated that curator input was crucial in helping him prioritize the collections that would provide the most text data for him to work with.
Digitization decisions can impact the effectiveness of computer vision algorithms. While designing her work, Tilton noticed variations in the collections she sampled. She sought information about digitization techniques and consulted with Library staff to learn more. How a photo is digitized—such as, with a frame or without—can affect how successfully computer vision is applied to an image. For example, the object detection algorithm could detect the frame as an object or determine that two images are similar simply because they are in an oval shape. In these instances, Tilton and Arnold had to crop out the image frames. Much of her results were shaped by how the items in the collection were digitized.
What’s Next for the Researchers?
In the coming months, Yelton plans to spend more time exploring and interpreting the results of her visualization to learn even more about the Reconstruction Era. She would also like to add some improvements to her visualization interface, including:
- Developing a guided tour to orient users to entry-points in the interface
- Processing each newspaper as distinct articles
- Performing additional user experience testing
Yelton sees several opportunities ahead for this work. Familiarizing the underlying neural net with common OCR errors could improve the outcomes of search results. Implementing search and suggestion functionality could enable users to uncover even more relationships between documents.
Going forward, Mullen will take steps to further analyze and interpret his results. He will continue to refine the model developed and plans to share more about his process on his website, America’s Public Bible. He will also explore the possibilities of applying language detection to text-based materials.
As part of ADDI, Tilton has prepared suggested practices for aggregating photographic data from the Library’s collections, in the hopes that others will find the information useful. She also hopes that her work encourages conversations about the role of digitization in the future of image analysis.
Looking to the Future
The researchers also hope that others will be able to build upon their work. Mullen and Tilton both plan to continue building the software and documentation they created during their work with CCHC. They want to ensure that others, from beginners to advanced computational researchers, can make use of their code and develop their own results.
When asked about how the researchers would advise others in undertaking computational research in an ethical manner, Yelton said that it was crucial to work in communities with “those who are most affected” by the work. Tilton suggested staying connected to emerging practice and scholarship, such as by making a reading list and using it to guide the work. Mullen emphasized that computational researchers must find the people who created the data to know how to best approach it.
What’s Next for the CCHC Initiative?
Over the past year, the CCHC team has taken a values-driven approach to the ways we have developed use cases, illustrated possible and current workflows for serving data to researchers, tested approaches for transferring data to researchers via a cloud storage system, and developed new understandings about the challenges computational researchers may face.
Learning from the researchers and Library colleagues has also provided the team with a greater understanding of the digital access afforded by current Library systems. This body of evidence will help the CCHC team define recommendations on how the Library could support future research.
Now that the researchers’ time with CCHC has ended, the team will reach out to our peers and potential users to gather further information about the challenges of large-scale collections access and support. We will also undertake a round of interviews with experts, members of the CCHC Advisory Board, and other cultural heritage organizations to gain specific insight and deepen our understanding of the requirements and conditions for enabling computational use of collections in the cloud – whether for research or creative use.
The team has some exciting next steps to explore before the initiative formally concludes in December 2022. We look forward to sharing more of our findings very soon. Stay tuned!