LC Labs is pleased to announce the release of three new sets of digitized materials packaged as data. Located on data.labs.loc.gov, the three data packages of digitized maps, books, and photographs respectively, were created as part of the Computing Cultural Heritage in the Cloud (CCHC) initiative’s investigation of how the Library can use cloud-based technologies to enable bulk analysis of materials at scale. The data are hosted in an experimental cloud repository and combine curated datasets derived from Library digital collections with user-friendly technical guidance and descriptive information.
3 New Data Packages and How to Access Them
The three data packages include:
- over 90,000 full text book files from the Selected Digitized Books collection;
- over 30,000 images from the Stereograph Cards collection; and
- more than 5,000 maps from the Library’s Austria-Hungary cartographic resources.
To provide access to the new material, our team established the LC Labs Data Sandbox, comprised of both a cloud hosting space (s3://data.labs.loc.gov) and a user-friendly interface on the LC Labs website (https://data.labs.loc.gov) depicted above. As you can see in the image below, each data package consists of a bundle of actual digital files as well as technical documentation about how the dataset was compiled and contextual documentation describing the digitization and collection history of the material at the Library. With this mix of both technical and humanistic information, these resources have great relevance for any students of the Digital Humanities, aspiring data visualization artists, or computational researchers seeking to ask questions across a large corpus.
While in LC Labs we have a history of making collections available as datasets, notably through our LC for Robots page, the new sandbox is unique in its ability to provide direct computational access to the materials in the sandbox. There are two main ways of accessing the data:
- Download the sample packs and supporting documentation directly from the data.labs.loc.gov website.
- Retrieve the entire corpus from the s3 backend (s3://data.labs.loc.gov) via command line interface, software development kit, or other computational method, or by using the corresponding dataset’s manifest.
How did we compile the data packages?
To develop the data packages, Chase Dooley, who was on detail from the Library of Congress Web Archives Team, and I collaborated with Library curatorial and technical staff to design processes for compiling, documenting and publishing digital collections materials as datasets. Notable steps of this workflow include repackaging digital collections materials as datasets for computational use; documenting the resulting datasets in keeping with cultural heritage and data science best practices; and making these datasets available to users via programmatic access pathways in keeping with the Library’s technical, security, and policy requirements. We’ve included a brief summary of each dataset’s creation below, pulling directly from the accompanying README files, which you can find on each data package’s landing page.
Stereograph Cards Data Package
This dataset was created using the LOC JSON/YAML API and comprises a scoped portion of the stereographs and not every item in the collection. Subject matter experts were consulted in the creation of a JSON API query (https://www.loc.gov/collections/stereograph-cards?dates=1800/1924&fa=access-restricted:false&q=no%20known%20restrictions&c=150&fo=json) to produce rights free stereographs from the 1850s through 1924, as a subset of what was available online in the collection on loc.gov in August 2022. This original query returned 44,694 results. The final dataset, after filtering out duplicates and items that had no images available, is comprised of 39,526 items.
Austro-Hungarian Map Set Data Package
This experimental dataset was produced in 2015 under the GIS Research Fellows program’s Geographic Hot Spot Dynamic Indexing Project. It is one of many map sets that were digitized and georeferenced as part of that project. The general workflow for the project consisted of:
- Imaging the physical sheets using commercial sheetfeed scanner, to 300 ppi TIFF files (likely using image editing auto-settings on the scanner).
- Manually transcribing the sheet coordinates and translating them to the Greenwich-meridian system (from the Ferro-Island-meridian system used on the original sheets).
- Straightening the image and cropping the map collar. (At the start of the project this was accomplished manually in Photoshop and later using an automated script developed as part of the project. It is not known which method was used on this particular set.)
- Adding the coordinate information to the TIFF file to convert it to a GeoTIFF (using an application called Quad-G).
- The GeoTIFF images were then loaded into ArcMap and used to generate a mosaic dataset footprint file. The footprint file was exported in shapefile format to create the shapefile found in this dataset.
Selected Digitized Books Data Package
This dataset was created using the LOC JSON/YAML API to fetch the metadata and an internal workflow processing and data management application to pull the associated full text from an LCCN. The metadata comprises all of the selected digitized books (as of 2022-08-26) that had a date associated with it. The LOC API has a maximum result of 100,000 objects. As of 2022-08-26, there were over 118,000 selected digitized books. However, only 90,414 had a date associated with it. So in ordered to get around the API’s maximum result limitation, only items with a date were gathered for this initial release. In future updates to the dataset, the additional items will be added. The two queries that were used to gather this initial data were:
- Dates before 1900: https://www.loc.gov/collections/selected-digitized-books/?c=150&dates=1000/1899&fa=access-restricted:false&fo=json
- Dates from 1900 on: https://www.loc.gov/collections/selected-digitized-books/?c=150&dates=1900/2099&fa=access-restricted:false&fo=json
As noted above, 1,894 items from the 90,414 items in the metadata do not currently have an extract associated with them. This set will be updated when those become available.
Feedback from Expert Data Jammers
Before making these data packages public, we enlisted the help of seven cultural heritage data experts from all over the world to test various access pathways and give us feedback. In our next blog post, scheduled for later this week, we’ll share more about what they found. Subscribe to the Signal Blog to make sure the next issue goes straight to your inbox!
If you need assistance or would like to provide input on your experience in the meantime, feel free to reach out to our team at LC-Labs@loc.gov. We will continue to explore the potential of cloud-based experimentation with Library datasets and want to hear from you!