New Year, New You: A Digital Scholarship Guide (in seven parts!)

To get 2018 going in a positive digital direction, we are releasing a guide for working with digital resources. Every Wednesday for the next seven weeks a new part of the guide will be released on The Signal. The guide covers what digital archives and digital humanities are trying to achieve, how to create digital documents, metadata and text-encoding, digital content and citation management, data cleaning methods, an introduction to working in the command line, text and visual analysis tools and techniques, and a list of people, blogs, and digital scholarship labs to follow to learn more about the topic. If you need all of this information immediately, feel free to binge on the full guide, available now in PDF. (No spoilers!)

This project is part of a larger exploration the Labs team is facilitating to create a reference service for working with collections as data at the Library of Congress. It is also part of the long-running Junior Fellows Summer Internship Program. Last summer, one of the interns we hosted, Samantha Herron from Swarthmore College, created this guide. We think Sam did a great job pulling together an introduction to what digital scholarship is and what you need to know to start planning this type of project, and we are thrilled to feature her work on The Signal. These blog posts also serve as an example of the kind of projects our Jr. Fellows work on and hopefully will inspire some of you recent grads to apply to our 2018 openings. Applications for this summer are due January 26, 2018. A modest stipend is provided.

Now, on to the guide:

Photo by Carol Highsmith. William De Leftwich Dodge's mural Ambition. Library of Congress Thomas Jefferson Building, Washington, D.C.

Photo by Carol Highsmith. William De Leftwich Dodge’s mural Ambition. Library of Congress Thomas Jefferson Building, Washington, D.C.

Samantha Herron’s Digital Scholarship Resource Guide [part 1 of 7]

Why Digital Materials Matter

Increasingly, digital archives are emerging and expanding. The Library of Congress’ Digital Collections (and therefore its metadata) are always growing, always adding exciting new materials like photographs, newspapers, web archives, audio tracks, maps, and so on. Text, images, and physical objects formerly only available in-person as tangible, hold-able items can now be accessed online as plaintext, digital facsimiles, marc files, .jpgs, .pdfs, hypertext, audio, etc. In addition to making these materials more accessible from all over the world, different digital formats enable exciting, computer-assisted scholarship, projects, and art.

For example:

This is Jane Austen’s Pride and Prejudice.

This is Jane Austen’s Pride and Prejudice.

This is Jane Austen’s Pride and Prejudice.

This is Jane Austen’s Pride and Prejudice.

So is this.

Though all of the above links–a modern day paperback, a digital facsimile, a plaintext copy, an audio recording, and (the catalog record for) a bound copy of the second edition of the book–refer to the same text–Jane Austen’s Pride and Prejudice–the kinds of scholarship, arguments, and manipulations we can do using each version depends on its format.

A contemporary paperback copy of Pride and Prejudice is likely no help in understanding early 19th century bookbinding practices in London, but the 1813 version of the same may give us some insight. Or, a physical, print copy of the book tells us nothing about word frequency (unless we wanted to count each word up by hand), but a computer could easily return vocabulary density information about a digital text copy. Digital copies do not replace physical texts, but instead open up the text to new kinds of computer-assisted analyses. Digital texts and digital data are the basis for what is broadly termed ‘digital scholarship’, the use of software, code, the Internet, GIS, and so on towards new understandings and visualizations of information.

Example: In July 2017, the New York Times covered projects that used data to understand the continued popularity of Jane Austen’s novels, and put forth that the key may have been in the author’s word choice. The authors used a method called “principal components analysis” to graphically represent the presence of naturalism in Austen’s texts.  Another study covered by the article found that the author used a higher rate of intensifiers (very, much, so) than her contemporaries and that, in context, this spoke to Austen’s characteristic use of irony.

Computers can be used to see trends and patterns that go unnoticed by the human eye. This is especially helpful for projects like the Jane Austen case study above, where the corpus of interest (the set of texts/other media used for analysis)–in that case, 127 works of early British fiction–would be too labor-intensive, unwieldy, or inappropriate to read one by one for the purposes of the research.  Computers can “read” a lot of text very quickly, and tell us information about a corpus that would be impossible to pick up from a close reading of a few books.

Next in this series: Creating Digital Documents and Metadata + Text Encoding