The following is a guest post by Laura Wrubel, software development librarian with George Washington University Libraries, who has joined the Library of Congress Labs team during her research leave.
The Library of Congress website has an API ( “application programming interface”) which delivers the content for each web page. What’s kind of exciting is that in addition to providing HTML for the website, all of that data–including the digitized collections–is available publicly in JSON format, a structured format that you can parse with code or transform into other formats. With an API, you can do things like:
- build a dataset for analysis, visualization, or mapping
- dynamically include content from a website in your own website
- query for data to feed a Twitter bot
This opens up the possibility for a person to write code that sends queries to the API in the form of URLs or “requests,” just like your browser makes. The API returns a “response” in the form of structured data, which a person can parse with code. Of course, if there were already a dataset available to download that would be ideal. David Brunton explains how bulk data is particularly useful in his talk “Using Data from Historical Newspapers.” Check out LC for Robots for a growing list of bulk data currently available for download.
I’ve spent some of my time while on research leave creating documentation for the loc.gov JSON API. It’s worth keeping in mind that the loc.gov JSON API is a work in progress and subject to change. But even though it’s unofficial, it can be a useful access point for researchers. I had a few aims in this documentation project: make more people aware of the API and the data available from it, remove some of the barriers to using it by providing examples of queries and code, and demonstrate some ways to use it for analysis. I approached this task keeping in mind a talk I heard at PyCon 2017, Daniele Procida’s “How documentation works, and how to make it work for your project” (also available as a blog post), which classifies documentation into four categories: reference, tutorials, how-to, and explanation. This framing can be useful in making sure your documentation is best achieving its purpose. The loc.gov JSON API documentation is reference documentation, and points to Jupyter notebooks for Python tutorials and how-to code. If you have ideas about additional “how-to” guides and tutorials would be useful, I’d be interested to hear them!
At the same time that I was digging into the API, I was working on some Jupyter notebooks with Python code for creating image datasets, for both internal and public use. I became intrigued by the possibilities of programmatic access to thumbnail images from the Library’s digitized collections. I’ve had color on my mind as an entry point to collections since I saw Chad Nelson’s DPLA Color Browse project at DPLAfest in 2015.
So as an experiment, I created Library of Congress Colors.
The app displays six colors swatches, based on cluster analysis, from each of the images in selected collections. Most of the collections have thousands of images, so it’s striking to see the patterns that emerge as you scroll through the color swatches (see Baseball Cards, for example). It also reveals how characteristics of the images can affect programmatic analysis. For example, many of the digitized images in the Cartoons and Drawings collection include a color target, which was a standard practice when creating color transparencies. Those transparencies were later scanned for display online. While useful for assessing color accuracy, the presence of the target interferes with color analysis of the cartoon, so you’ll see colors from that target pop up in the color swatches for images in that collection. Similarly, mattes, frames, and other borders in the image can skew the analysis. As an example, click through the color bar below to see the colors in the original cartoon by F. Fallon in the Prints and Photographs Division.
This project was a fun way to visualize the collection while testing the API, and I’ve benefited from working with the National Digital Initiatives team as I developed the project. They and their colleagues have been a source of ideas for how to improve the visualization, connected me with people who understand the image formats, and provided LC Labs Amazon Web Services storage for making the underlying data sets downloadable by others. We’ve speculated about the patterns that emerge in the colors and have dozens more questions about the collections from exploring the results.
There’s something about color that is delightful and inspiring. Since I’ve put the app out there, I’ve heard ideas from people about using the colors to inspire embroidery, select paint colors, or think about color in design languages. I’ve also heard from people excited to see Python used to explore library collections and view an example of using a public API. I, myself, am curious to see what people may find as they explore Library of Congress collection as data and use the loc.gov JSON API or one of the many other APIs to create their own data sets. What could LC Labs do to help with this? What would you like to see?