Inside, Inside Baseball: A Look at the Construction of the Dataset Featuring the Smithsonian’s National Museum of African American History and Culture and the Library of Congress Digital Collections

This is a guest blog post by visiting scholar archivist Julia Hickey who is on a professional development assignment from the Defense Media Activity to the Library of Congress Labs team. Julia has been helping us prepare for and build out a visualization of collection data for our Inside Baseball event. This post was also edited by Eileen Jakeway and Courtney Johnson, 2018 Jr. Fellows with the Labs team. 

Julia Hickey

Julia Hickey

After weeks of preparations and four days of fast-pitched ideation and creation, this Friday LC Labs will unveil the efforts of “Inside Baseball” – a collaboration between the Library of Congress, the Smithsonian National Museum of African American History and Culture, and JSTOR Labs. Joining the Baseball Americana batting lineup, this week of flash-building and design-thinking will debut new visualizations and prototypes to bring baseball-related digital collections to center field!

In preparation for this event, Julia Hickey, a visiting scholar with the LC Labs team, worked to prepare a data set including not only items from the Library’s collection but also items held by the National Museum of African American History and Culture (NMAAHC). Given the variations in terms, cataloging methods and subject heading names (to give only a few examples), this task entailed a great deal of effort to map the two sets along the data points they had in common.

This process was also an important first step in preparing this data for use by team members of JSTOR and LC Labs, who used it as the foundation for the development of tools being showcased at Friday’s event. The following blog post is a chance for Julia to describe this work in more detail, and show our readers how thinking of collections as data might lead to more effective collaboration between cultural heritage institutions, non-profits and museum curators!  


Standardizing Each Dataset

The first task in creating a unified dataset pulling items from both NMAAHC and the LC was to design a metadata map, or crosswalk, to serve as a key, or legend, to the final dataset incorporating information from both collections. This document arranges metadata fields from the various schemas, or cataloging systems, to ensure their data values fall under common headings and to show discrepancies in cases when they do not.

Building this metadata map began with the assessment of all the Library’s baseball-related digital collections, spanning divisions such as photographs, manuscripts and papers, all of which are subject to different cataloging techniques depending on their medium. These practices evolved over time leaving different patterns of information even within one medium and across the overall baseball digital collection items. Understanding the evolution of this data was critical given its standing as the first schema placed into the crosswalk, or metadata map.

The second schema to be entered into the crosswalk was the data acquired from the National Museum of African American History and Culture. Standardizing this data was also related to cataloging processes, but from a museum perspective. Understanding the precise values and where to find them in the exported data fields was important to in order to plan a consistent approach and final result. Experimenting with filtering helped to learn the data patterns to ensure proper alignment upon the automated mapping. Marginal situations were manually managed and placed into the corresponding united fields.

crosswalk map image

Merging Datasets

Now that each dataset had been standardized and cleaned, the process of merging the two datasets could begin. In order to “match” NMAAHC’s metadata to the LC fields, it was fundamental to further examine and understand the information stored within these fields to ensure accuracy of the crosswalk and final dataset. An idea suggested early in the intellectual construction of the crosswalk was to include an additional metadata schema called Dublin Core as another point of reference within the metadata map. Dublin Core is a simplified system designed to allow fluid translation to nearly every metadata schema with the emphasis on common field names. By its very design, incorporating Dublin Core immediately offered unification of the two institutions’ metadata because Dublin Core’s simplistic and relatable methodology boils complexity into user-friendly field names. For example, field “260$c” from Marc, used by the Library, is “Date” in Dublin Core while “Author, Artist, Inventor or Publisher” is easily mapped to “Creator” to capture all types of artistic or scientific origin. The use of “Creator/Publisher” became one of our common field names for the final dataset. Adding “Publisher” to the field name was essential as a majority of collection items are printed publications such as baseball cards, books and papers.

With the help of Dublin Core, the crosswalk was approved by both institutions. The work to format the merged dataset using common terms could now begin; the familiarization with each institutions’ data paid off in impactful ways, leading to easier metadata decisions that normalized the data. For instance, both institutions had date fields but the formats varied according to how much information was available about a particular collection item. Some dates within LC’ s collection were MM/DD/YYYY while others had only a “circa” date or a date range. The museum’s format was more consistent but there was still no common format once the data was combined. Adding a field titled “Era” offered a uniformity across and within each institution’s now combined collection items. As such, the work to normalize and standardize this data field was as simple as writing an Excel formula to convert the year found in the date field into the decade it belongs and to place this value into a distinct field, or excel column, titled “Era.”

In the final dataset, common values were decided upon to unite the two collections within this one field . For example, the Library has used the value “notated music” to describe the original format for what would be more commonly known as sheet music. Notated music is the proper classification of sheet music within the library catalog but not commonly recognized outside that context. Uniting the collections into a common field titled Object Type was more complicated than an automated Excel formula. Given the precision involved and the possibility of consulting more than the values in the combined object type lists, individually going through each collection item and assigning them one of the values from the picklist standard gave the best assurance of consistent success without wasting time developing a formula or script to make the conversion in bulk.

Another value that caused complications] when merging the two datasets was the “Subjects” fields used by both institutions to record the topics or subject matter presented within the collection item. The discussion about creating a uniform subject list was determined too intensive given the deadline for the flash-build week. The complication was not technical but intellectual and ethical. NMAAHC has a distinct mission to accurately and authoritatively speak about African American history and culture within and amongst their collection items. The Library recognized its under-representation of African American subjects within its cataloging techniques – something unfortunately common across many cultural heritage institutions for all minorities in history. Instead of trying to standardize the subjects from each institution into one list, the idea to do so, relying on the authority of NMAAHC was put into a “parking lot” for future projects. The subjects were kept as provided from the institutions’ datasets. Where the Museum’s subjects could augment and even fill in gaps in the Library’s collections, subjects have been generated in a list to recommend to the Library’s cataloging division.

Final Dataset and Further Documentation

As described above, eight weeks of preparatory work resulted in a final, incorporated dataset accompanied by a detailed metadata map to show the process of cleaning, standardizing and merging the two datasets. This additional documentation is important to provide to the teams working with this data to develop a final digital product. By diagramming the decision points discussed above and showing a data model of the mapping, the programmers working with this information are made aware of the many curatorial, ethical and cultural decisions being made around the organization of that data. Diagrams of the data mapping  thus serve as helpful secondary sources of information for database administrators, and, in this instance, for the developers to understand an additional layer of how each metadata field speaks to its neighbor and aligned field. This documentation can also be used for future partnerships with the Smithsonian while serving as a template for joint endeavors in general for the Library’s future collaborative efforts.


Now that you’ve read about how the dataset was prepared, tune in this Friday, July 13 to watch baseball collections step up to the digital plate! The final “Inside Baseball” showcase will take place from 9:30 a.m. to 3 p.m. in the Library of Congress’ Coolidge Auditorium, as our teams debut new prototypes and tools, discuss the warm-ups leading to the reveal, and host a panel conversation feature ESPN’s Clinton Yates, baseball historian Rob Ruck, and mathematician Jordan Ellenberg. Can’t attend in person? Don’t worry! We will also be live streaming the entire event here.