WWI Linked Open Data: An Interview with Thea Lindquist

The following is a guest post from Jane Mandelbaum, co-chair of the National Digital Stewardship Alliance Innovation Working group and IT Project Manager at the Library of Congress.

Thea Lindquist, associate professor and history librarian at the University of Colorado Boulder

In this installment of the NDSA innovation working group’s ongoing series of innovation interviews I interview Thea Lindquist. Thea Lindquist is  an associate professor and history librarian at the University of Colorado Boulder. Through a project to digitize and enhance access to a collection of World War I materials, she became interested in the potential to increase interoperability and discovery across digital historical collections. In 2011 she spent the fall term at Aalto University in Finland working with the Semantic Computing Research Group on a World War I linked open data project, which is ongoing.  The work was submitted to the 2013 Linked Open Data in Libraries, Archives, and Museums Summit as an entry to a challenge.  The entry video is available online, as is a demonstration of the project.

She is particularly interested in the geospatial-temporal aspect of linking data, springing in large part from her previous work as a geospatial information librarian at the University of Michigan.

Jane: Can you tell us how you got interested in the project you worked on as a Fulbright scholar?

Thea: It started with a related project a colleague and I were working on to create a user-centered digital tool for work with online primary sources. As a part of this work, we conducted a user needs assessment with humanities students and faculty at CU to pinpoint what would make it easier and more interesting for them to engage with these sources. The big takeaways were not entirely unexpected: improved findability of documents – and the data within them – on specific people, places, topics and timeframes as well as more historical and archival context for the documents and data. In brainstorming ideas for the tool, I learned about linked data and how it could help ameliorate many of the problems associated with work with online primary sources. A big one is the findability of the sources – users often find the metadata inadequate to expose individual sources and especially sections within them with the desired granularity. Also, since similar concepts are expressed variantly across texts, keyword searching is haphazard. Online primary sources are even more susceptible to decontextualization, since keyword searching encourages users to look for snippets of a document in which a given term is mentioned and then skip forward to the next occurrence, rather than reading the document in its entirety. Search engines and collections of links to online sources can contribute to this problem by disaggregating individual documents from their archive of origin. Another issue is lack of context, which is necessary for many users, especially students and non-experts, to engage with the substance of the material. This context can include displaying the relationships between individual documents as well as resources that help explain how each document, and the information within it, fits into its historical context. Even with relevant sources and adequate context, users may struggle with further challenges inherent to primary-source research: foreign languages, document bias, historical usage, orthography, grammar, paleography/typography, etc. Once the utility of linked data for the purpose of addressing these problems – at least in an ideal implementation – was apparent, I needed time to learn more and find partners with specialized expertise on the technical side, so I decided to write up a Fulbright project to do just that.

Jane: How did you start working with the Semantic Computing Research Group at Aalto University in Helsinki?

Thea: At the time I wrote the grant, SeCo was one of the few groups that had published research on a Linked Data approach with digital cultural heritage materials, and particularly on digitized primary sources. I was fortunate that the director, Eero Hyvönen, and his group were interested in testing an innovative approach as I was, namely in going beyond the metadata to deep link in online primary sources and demonstrate to what extent we could improve access to and context in the sources in CU’s World War I Collection Online, testing both manual and automated methods of semantic annotation. SeCo developed several of the tools we have been using in this process, particularly the SAHA browser-based semantic annotator.

Jane:  How do you describe to people what semantic computing might do for them?

Thea: Usually I say that it associates related concepts, increases findability, context and interoperability, enables semantically rich services (like faceted searching, content recommendations, and visualizations) and allows re-use, re-mixing and re-presenting of data. If they look puzzled, I start by comparing the current web of documents to the web of data. When you search for a term on the web of documents, the computer looks for the string of characters you entered, and it has no idea what the meaning associated with those characters are. When it finds matches, it returns the documents in which they are found, and it is up to you to slog through those and figure out if any of the matches are indeed relevant. If you look for “buck”, you could get documents about a male, antlered animal, a dollar, throwing (a rider) by bucking, giving someone a ride on your bike (this usage may be limited toMinnesota)…you get the picture. On the web of data, supported by ontological structures and intelligent applications, the computer can understand the word “buck” might have different meanings and what those might be, and it will ask you “are you interested in the monetary unit?” (among other things). If you say yes, it will direct you to the relevant data residing within documents rather than the entire document, whether the character string says “buck”, “dollar” or “single”.

In the historical context, it can help users find information in a variety of languages, for example, about places with alternate names or whose spellings have changed over time (Bratislava/Prešporok/Pressburg [formerly Preßburg]/Pozsony) and geographies that have merged with, split from and been subsumed by other entities with which they are associated (Bohemia/Czechoslovakia/Czech Republic). It also allows searches across all Linked Data and surfaces it to the top level where users are searching. From the perspective of digitized cultural heritage collections, which are often hived off in databases under institutional web sites, this is hugely useful. There are also some good resources out there to point people to for examples, like the sig.ma semantic search engine, Europeana’s “Linked Open Data – what is it?” video, and SeCo’s CultureSampo semantic portal.

A screen shot of the interface from the WW1 Linked Open Data Demo

Jane: How did you get interested in using the UC Boulder collection of World War I primary materials?

Thea: The collection was a surprise discovery while I was doing my first review of the history collection for offsite storage. In one of the many ranges of compact shelving in the basement, I came across 56 bound volumes with the title “World War pamphlets”. The material in them was amazing, and the only point of access was a skeletal record in the catalog with the same title and one subject heading, “World War, 1914-1918”, i.e., next to no access. On top of that, the paper was terrible, and the materials were deteriorating rapidly. I realize now that one of the reason they had not yet turned to corn flakes was that they had resided so long undisturbed in the basement. There is a lot of interest in World War I on the CU campus. For preservation and access purposes – not just for history classes at CU but for the wider world, I proposed the collection be digitized and made keyword searchable. We funded the digitization project through several grants. The WWI Collection Online is currently the CU Libraries’ largest digital collection.

Jane: How do you think institutions with primary materials collections (like the WWI collection) can take advantage of linked data to improve the access and use of their collections?

Thea: Institutions can start with the low-hanging fruit – their metadata. There are low-barrier tools available now that will allow them to make their digital collections more discoverable using linked data principles, like Viewshare. As you know, Viewshare allows institutions to easily generate and customize visualizations like timelines, interactive maps, and tag clouds – things that we did the hard way not too long ago using a variety of tools! Users really appreciate having a variety of ways to explore the content, and institutions don’t have to have a programmer on staff to do it.

Jane:  Where do you see the intersection of historians and librarians in working with digital collections of primary materials?

Thea: Historians bring the specialized knowledge necessary in their areas of expertise to projects drawing on digital collections as well as ideas about how they and their students might best use these collections. Librarians are often the ones who digitize, organize, and make collections of value to historians accessible now and in the long term. Subject specialists, particularly ones who are technology fluent, understand the needs of their aggregate user group (broad, as compared to the historians’ deep) and may serve as the common point of contact in multi-disciplinary groups working on projects to make these collections more accessible. Someone who is both a librarian and an historian might be able to take things a bit further in each area than they might have otherwise, but the input of experts – in the case of WW1LOD, WWI historians specializing in Belgium and France, metadata specialists, digital initiatives librarians, and of course computer scientists – is absolutely critical.

Jane:  What do you think users find the most appealing about digital collections of primary materials?

Thea: Having the look of the original documents paired with the power of discovering and viewing the content in interactive ways, from keyword searching across a large corpus to visualizations of selected data points, at any time of day and from anywhere they have an internet connection.

Jane: How do you think visualization tools like maps and timelines benefit from linked data implementations?

Thea: In much the same way that all applications do. Linked Data fosters interoperability and the representation of instances – people, places, events, etc. – in different ways, e.g., by linking alternate name forms that were valid during certain time frames. It allows the applications consuming it to query and connect to Linked Data on the web and make inferences by drawing upon ontologies underlying it. A map visualization could then display boundaries for the Roman Empire in the 2nd century AD and multilingual mapping of places (Wien/Vienna/Vienne/Vindobona), that is, if the necessary elements are there in terms of data and structure. One of the greater challenges in this scenario is the availability of historical boundaries so an accurate map can be generated on which to display the point data, and going that far back in time, current point data is also likely to be incomplete and less accurate. Another is access to geospatial ontologies with relevant historical coverage. I believe this will come, but it will take time and resources. Timeline application fueled by Linked Data could give a more nuanced display of events because alternate timeframes can be shown, again given the necessary elements. For instance, each of the major belligerents that fought on the Western Front in WWI (UK, France, Belgium, US and Germany) produced an official list giving the names and dates of engagements in which their troops took part, with inevitable discrepancies between them. For the Germans the Autumn Battle in Champagne ended on November 3, 1915; but for the French and Belgians, the 2nd Battle of Champagne ended on November 6, 1915. The timeline could show and compare these differing viewpoints.

Jane:  You have described building a specialized vocabulary for describing the civilian experience in one country,Belgium, during the war and building semantic frameworks for military events?  How did these efforts get started and how can they be used?

Thea: We started with the civilian experience in occupied Belgium in WWI since the documents were richer there, but the vocabulary has since been extended to cover occupied France as well. This topic was selected for more intensive semantic linking not only because it was well-represented in the WWI Collection Online, but also because the impact of “total war” on civilian populations is an area of current scholarly interest. Most of the publications in the collection falling into this category deal with the hardships civilians suffered during the German invasion and occupation of Belgium and northern France, particularly atrocity incidents such as killings and worker deportations and the impact of military rule on day-to-day life. The general, event-based framework for WWI was planned from the outset as a contribution that could be of value to many cultural heritage institutions seeking to expose their WWI-related digital collections, particularly in the run-up to the centenary. It includes key military, political and social events, the basis of which was timeline data shared by the Imperial War Museum’s First World War Centenary Partnership Programme. It is meant to be shared widely, thus providing the “semantic glue” that binds separate datasets relating to WWI together and allows searching and browsing in the broader corpus. The specialized vocabulary, event-based framework and other structures we have created for this project will be made freely available for reuse via a data dump and SPARQL endpoint.

Jane: You worked previously as a geospatial information librarian. Can you talk about how that is different than being a maps librarian? Can you describe what you learned about preserving geospatial information?

Thea: The job title took in the fact that I not only developed and helped users access analog resources but also digital resources. Much of my job was helping users in the humanities and social sciences find geographically referenced information and then use GIS to analyze and visualize it in ways that were meaningful to their research. The data really didn’t become a map until it had reached the visualization stage. It was a lot of fun to help one user find the town their grandparents came from in present-day Poland using historical gazetteers and then turn around and help another mashup data on how racial and social factors relate to unemployment in Flint, Michigan. At the time – over ten years ago – there weren’t many conversations about how we would preserve and provide longer-term access to our digital assets other than backing them up on hard drives and servers. The print maps were another story. They were housed flat in special map cabinets in an environmentally controlled area and received conservation and preservation treatment from a dedicated lab.