Creating Digital Documents
The first step in creating an electronic copy of an analog (non-digital) document is usually scanning it to create a digitized image (for example, a .pdf or a .jpg). Scanning a document is like taking an electronic photograph of it–now it’s in a file format that can be saved to a computer, uploaded to the Internet, or shared in an e-mail. In some cases, such as when you are digitizing a film photograph, a high-quality digital image is all you need. But in the case of textual documents, a digital image is often insufficient, or at least inconvenient. In this stage, we only have an image of the text; the text isn’t yet in a format that can be searched or manipulated by the computer (think: trying to copy & paste text from a picture you took on your camera–it’s not possible).
Optical Character Recognition (OCR) is an automated process that extracts text from a digital image of a document to make it readable by a computer. The computer scans through an image of text, attempts to identify the characters (letters, numbers, symbols), and stores them as a separate “layer” of text on the image.
Example Here is a digitized copy of Alice in Wonderland in the Internet Archive. Notice that though this ebook is made up of scanned images of a physical copy, you can search the full text contents in the search bar. The OCRed text is “under” this image, and can be accessed if you select “FULL TEXT” from the Download Options menu. Notice that you can also download a .pdf, .epub, or many other formats of the digitized book.
Though the success of OCR depends on the quality of the software and the quality of the photograph–even sophisticated OCR has trouble navigating images with stray ink blots or faded type–these programs are what allow digital archives users to not only search through catalog metadata, but through the full contents of scanned newspapers (as in Chronicling America) and books (as in most digitized books available from libraries and archives).
As noted, the automated OCR text often needs to be “cleaned” by a human reader. Especially with older, typeset texts that have faded or mildewed or are otherwise irregular, the software may mistake characters or character combinations for others (e.g. the computer might take “rn” to be “m” or “cat” to be “cot” and so on). Though often left “dirty,” OCR that has not been checked through prevents comprehensive searches: if one were searching a set of OCRed texts for every instance of the word “happy,” the computer would not return any of the instances where “happy” had been read as “hoppy” or “hoopy” (and conversely, would inaccurately find where the computer had read “hoppy” to be “happy”). Humans can clean OCR by hand to “train” the computer to interpret characters more accurately (see: machine learning).
In this image of some OCR, we can see some of the errors–the “E”s in the title were interpreted as “Q”s, in the third line, a “t’” was interpreted by the computer as an “f”.
Even with imperfect OCR, digital text is helpful for both close readings and distant reading. In addition to more complex computational tasks, digital text allows users to, for instance, find the page number of a quote they remember, or find out if a text ever mentions Christopher Colombus. Text search, enabled by digital text, has changed the way that researchers use database and read documents.
Metadata + Text Encoding
Bibliographic search–locating items in a collections–is one of the foundational tasks of libraries. Computer-searchable library catalogs have revolutionized this task for patrons and staff, enabling users to find more relevant materials more quickly.
Metadata is “data about data”. Bibliographic metadata is what makes up catalog records, from the time of card catalogs to our present day electronic databases. Every item in a library’s holdings has a bibliographic record made up of this metadata–key descriptors of an item that help users find an item when they need it. For example, metadata about a book might include its title, author, publishing date, ISBN, shelf location, and so on. In a electronic catalog search, this metadata is what allows users to increasingly narrow their results to materials targeted to their needs: Rich, accurate metadata, produced by human catalogers, allow users to find in a library’s holdings, for example, 1. any text material, 2. written in Spanish, 3. about Jorge Luis Borges, 4. between 1990-2000.Metadata needs to be in a particular format to be read by the computer. A markup language is a system for annotating text to give the computer instructions about what each piece of information is. XML (eXtensible Markup Language) is one of the most common ways of structuring catalog metadata, because it is legible to both humans and machines.
XML uses tags to label data items. Tags can be embedded inside each other as well. In the example below, <recipe> is the first tag. All of the tags inside between <recipe> and it’s end tag </recipe>, (<title>, <ingredient list>, and <preparation>) are components of <recipe>. Further, <ingredient> is a component of <ingredient list>.
MARC (MAchine Readable Cataloging) standards, developed in the 1960s by Henriette Avram at the Library of Congress, is the international standard data format for the description of items held by libraries. Here are the MARC tags for one of the hits from our Jorge Luis Borges search above:
The three numbers in the left column are “datafields” and the letters are “subfields”. Each field-subfield combination refers to a piece of metadata. For example, 245$a is the title, 245$b is subtitle, 260$ is the place of publication, and so on. The rest of the fields can be found here.
MARCXML is one way of reading and parsing MARC information, popular because it’s an XML schema (and therefore readable by both human and computer). For example, here is the MARCXML file for the same book from above: https://lccn.loc.gov/99228548/marcxml
The datafields and subfields are now XML tags, acting as ‘signposts’ for the computer about what each piece of information means. MARCXML files can be read by humans (provided they know what each datafield means) as well as computers.
The Library of Congress has made available their 2014 Retrospective MARC files for public use: http://www.loc.gov/cds/products/marcDist.php
Examples The Library of Congress’s MARC data could be used for cool visualizations like Ben Schmidt’s visual history of MARC cataloging at the Library of Congress. Matt Miller used the Library’s MARC data to make a dizzying list of every cataloged book in the Library of Congress.
TEI (Text Encoding Initiative) is another important example of xml-style markup. In addition to capturing metadata, TEI guidelines standardize the markup of a text’s contents. Text encoding tells the computer who’s speaking, when a stanza begins and ends, and denotes which parts of text are stage instructions in a play, for example.
Example Here is a TEI file of Shakespeare’s Macbeth from the Folger Shakespeare Library. Different tags and attributes (the further specifiers within the tags) describe the speaker, what word they are saying, in what scene, what part of speech the word is, etc. With an encoded text like this, it can easily be manipulated to tell you which character says the most words in the play, which adjective is used most often across all of Shakespeare’s works, and so on. If you were interested in the use of the word ‘lady’ in Macbeth, an un-encoded plaintext version would not allow you to distinguish between references to “Lady” Macbeth vs. when a character says the word “lady”. TEI versions allow you to do powerful explorations of texts–though good TEI copies take a lot of time to create.
Understanding the various formats in which data is entered and stored allows us to imagine what kinds of digital scholarship is possible with the library data.
Example The Women Writers Project encodes with TEI texts by early modern women writers and includes some text analysis tools.
Next week’s installment in the Digital Scholarship Resource Guide will show you what you can do with digital data now that you’ve created it. Stay tuned!