The following is a guest post by Kalev H. Leetaru of Georgetown University (Former), Robert Miller of Internet Archive and David A. Shamma from Yahoo Labs/Flickr.
In 1994, linguist Geoff Nunberg stated, in an article in the journal “Representations,” “reading what people have had to say about the future of knowledge in an electronic world, you sometimes have the picture of somebody holding all the books in the library by their spines and shaking them until the sentences fall out loose in space…” What would these fragments look like if you took every page of every book from 2.5 million volumes dating back over 500 years? Could every illustration, drawing, chart, map, photograph and image be extracted, indexed and displayed? That was the question that launched the Internet Archives Book Images Project to catalog the imagery from half a millennium of books.
Over 14.7 million images were extracted from over 600 million pages covering an enormous variety of topics and stretching back to the year 1500. Yet, perhaps what is most remarkable about this montage is that these images come not from some newly-unearthed archive being seen for the first time, but rather from the books we have been digitizing for the past decade that have been resting in our digital libraries.
The history of book digitization has focused on creating text-based searchable collections–the identification of tens of millions of images on the pages of those books has historically been regarded as merely a byproduct of the digitization process. We inverted that model, reimagining books as containers of images rather than text. We then explored how digital libraries can yield rich visual collections by using modern image recognition technology coupled with Flickr to ultimately create one of the largest visual book collections in history. This involved automatically identifying visual content, cropping it out, extracting surrounding metadata via optical character recognition and uploading and indexing the structured data to Flickr. In effect, we are repurposing the vast archives of digital content created for text search and transforming it into a visual gallery of imagery. In doing so, we are creating a new way of “seeing” our cultural history.
How does one go about creating an archive of images spanning over 500 years? After seeing Albert Bierstadt’s “Among the Sierra Nevada Mountains, California” Kalev Leetaru (a co-author of this post) became curious as to how the American West of the nineteenth century was portrayed in literature–in particular how it was portrayed in imagery rather than words. Yet, searches in various digital libraries for Albert Bierstadt’s name yielded almost exclusively textual descriptions of his artwork, not photographs or illustrations – the same was true for searches of other nineteenth century technologies like the telegraph, the telephone, and the railroad.
Current digital libraries are designed to search for a word or phrase and return a list of pages mentioning it. The notion of searching for images appearing beside a word or phrase is simply not supported. While the concept of “image search” has become ingrained in our daily lives, it has largely remained a tool exclusively for searching the web, rather than other modalities like the printed word. As we’ve sought to find structure online we have seemingly overlooked the inherent knowledge within printed books. The call was clear: expand beyond a simple search box query and unlock the imagery of the world’s books. With this call in hand, Leetaru approached the Internet Archive, as one of the world’s largest open collections of digitized books, for permission to use their vast collection of over 600 million pages of digitized books stretching back half a millennia, and Flickr, as one of the largest online image services, to host the final collection of images in an interactive and searchable form.
A System Prototype
Historically, PDF versions of scanned books were created as what is called “image over text” files: the scanned image of each page is displayed with the OCR text hidden underneath. This works well for desktops and laptops with their unlimited storage and bandwidth, but can easily generate files in the tens or hundreds of megabytes. The rise of eReader devices, with their limited storage and processing capacities, necessitated more optimized file formats that save books as ASCII text and extract each visual element as a separate image file. Many digital libraries, including the Internet Archive, make their books available this way in the open EPUB file format, which is essentially a compressed ZIP file containing a set of HTML pages and the book’s images as JPG, PNG or GIF files. Extracting the images from each book is therefore as simple as unzipping its EPUB file, saving the images to disk, and searching the HTML pages to locate where each image appears in the book to extract the text surrounding it.
The simplicity here is what makes it so powerful. The hard part of creating an image gallery from books lies in the image recognition process needed to identify and extract each image from the page, yet this task is already performed in the creation of the EPUB files. This also means that this process can be easily repeated for any digital library that offers EPUB versions of their works, making it possible to one day create a single master repository of every image published in every book ever digitized. The entire processing pipeline was performed on a single four-processor virtual machine in the Internet Archive’s Virtual Reading Room. While all of the books used are available for public download on the Archive’s website, using the Virtual Reading Room made it possible to work much more easily with the Archive’s collections, dramatically reducing the time it took to complete the project.
Creating the Final Archive
The chief limitation of this prototype solution was that the images in EPUB files are downsampled to minimize storage space and processing time on the small portable devices they are designed for. Creating a gallery of rich high-resolution imagery from the world’s books requires returning to the original raw page scan imagery and using the OCR results to identify and crop each image from those scans.
From the beginning, it was decided to leverage the existing OCR results for each book instead of trying to develop new algorithms for identifying images from scanned book pages. It is hard to improve upon the accuracy of OCR software designed solely for the purpose of identifying text and images from scanned pages and developed by companies with large research and development staffs focusing exclusively on this very topic. In addition, few libraries have large computational clusters capable of running sophisticated image processing software, with OCR either outsourced to a vendor or run in scheduled fashion on dedicated OCR servers. By using the existing OCR results, the most computationally-intensive portion of the process is simply reused and all that remains is to crop the images from the page scans using the OCR results.
The final system uses the raw output of the Abbyy OCR software that the Internet Archive runs over each book, coupled with the scanning data produced when the book was originally digitized, to identify and extract the images at the original digitization resolution. As part of the OCR process to extract searchable text from each page, the OCR software identifies images on each page and sets them aside. Similar to the process used to create the EPUB files, this image information was used to extract each of these images to create the final gallery, along with the text surrounding each image.
The process begins by downloading from the Internet Archive the master list of all files available for a book, including its original page scan images and Abbyy OCR file. The original page scan imagery (JPEG2000, JPEG or TIFF format) and the OCR’s XML output must be present or the book is skipped. Next, the “scandata.xml” file (which contains a list of all pages in the book and their status) is examined to locate page scans that were flagged by the human operator of the book scanning system for exclusion. An example might be a bad page scan where the page was not in proper position for scanning and thus was rescanned. In this case instead of deleting the bad scan images, they are simply flagged in the scandata.xml file. Similarly, to ensure proper color calibration, a “color card” and ruler is photographed before and after each book is scanned (and sometimes periodically at random intervals throughout). These frames are also flagged in the XML file for exclusion. These page scans were dropped before the OCR software was run on the book by the Archive and thus must be identified to align the page numbering.
Next, the Abbyy OCR XML file is examined to locate each image in the book and its surrounding text. This file has an enormous wealth of information calculated by the OCR software. It breaks each page into a series of paragraphs, each paragraph into lines, and each line into individual characters. It also provides a confidence measure on how “sure” it is of each individual letter. Each region, line, character and image includes “l,t,r,b” parameters yielding its “left,” “top,” “right” and “bottom” coordinates in pixels within the original page scan image. Thus, to extract each image from a book, one simply scans the XML file for the “image” regions, reads its coordinate and crops this region from the original page scan imagery – no further analysis is necessary. The text surrounding each image is also extracted from the OCR file to enable keyword searching of the images by their context.
Now that the list of images and their surrounding text has been compiled, the system must download the full-resolution page scans to extract the images from. The problem is that the page scans for each book are delivered as ZIP files that are usually several hundred megabytes, occasionally exceeding one gigabyte. While at first this may not seem like much, when multiplied by a large number of books, the network bandwidth and the speed of computer hard-drives become critical limiting factors. A set of 22-core virtual machines were used to handle much of the computing needs of this project.
Typically one might attempt to process one book per core, meaning 22 books would be processed in parallel at any given time. If all 22 books had 500MB ZIP files containing their full resolution page scan imagery, this would require downloading 11GB of data over the network per second. Assuming that each book takes less than a second of CPU time on average to process, this would require a 100Gbps network link working at 100% capacity to sustain these processing needs. Even if this was broken into 22 separate virtual machines, each with a single core, all 22 machines would still each require their own four Gbps network link working at 100% capacity and delivering 500MB/s bandwidth.
Even if the network bandwidth limitations are overcome, the greatest challenge is actually the disk bandwidth of writing a 500MB ZIP file to disk from the network and then unpacking it (reading the 500MB) and updating the file system metadata to handle several hundred new files being written to disk (writing its 500MB of contents to disk). Thus, all said and done, a single 500MB ZIP file requires 500MB to be read twice and written twice, totaling two GB of total IO. Processing 22 files per second would require 44GB/s of disk bandwidth.
Many cloud computing vendors limit virtual machines to around 100MB/s sustained writes and 200MB/s sustained read performance, while even a dedicated physical three Gbps SATA hard drive operating at peak capability would still require two seconds just to write the ZIP file to disk and another two seconds to unpack it and write its contents to disk as individual files (assuming the intermediate reads are kernel buffered). Since writes are linear, SSD disks do not provide a speed advantage over traditional mechanical disks. While more exotic storage infrastructures exist that can support these needs, they are rarely found in library environments.
Our processing pipeline was therefore designed to minimize network and disk IO even at the cost of slowing down the processing of a single book. In the final system, if a book had less than 50 pages containing images to be extracted the page images were downloaded individually via the Archive’s ZIP API. Instead of downloading the entire ZIP file to local disk, the ZIP API allows a caller to request a single file from a ZIP archive and the Archive will extract, uncompress and return that single file from the ZIP archive. For optimization, contents of the ZIP were extracted individually via calls to the Archive’s web service at an average delay of around two seconds per image.
A book with 50 pages containing images would therefore require around 1.6 minutes to download all of the images, whereas on the virtual machines used for this project the raw page scan ZIP could be downloaded in under 30 seconds. However, downloading multiple full ZIP files would exceed the network and disk bandwidth on the virtual machines, raising the total time required to download the full ZIP file to about 10 minutes. For books with more than 50 pages of images, the full ZIP was downloaded, but only the needed pages were unpacked from the ZIP file instead of all pages, again reducing the read/write bandwidth. Combined, these two techniques reduced the IO load on each virtual machine to the point that the CPUs were able to be kept largely occupied when running 44 books in parallel.
Finally, the book images needed to be cropped from the full resolution page scans. The de facto tool for such image operations is ImageMagick, however its performance is extremely slow on large JPEG2000 images. Instead, we used the specialized “kdu_expand” tool from Kakadu Software to extract images and CJPEG to write the resulting image to disk. This resulted in a speedup of eight to 30 times by comparison to ImageMagick. To minimize memory requirements, the final system actually generates a shell script and then exits and invokes the shell script to perform all of the actual image processing components of the pipeline, freeing up the memory originally used to process the XML files since per-core memory was highly limited on the systems available for this project. Finally, the list of extracted JPEG images and a tab-delimited inventory text file listing the attributes of each image and the text surrounding them is compressed into one ZIP file per book, ready for use. Thus, while conceptually quite simple, the final system required considerable iterative development and enormous effort to minimize IO at all costs.
The Archive, the Flickr Commons, and the Future
With the mechanics of data management and image extraction completed, the first 2.6 million images from the collection were uploaded to the Internet Archive Book Images collection in the Flickr Commons in July/August 2014, making it searchable and browseable to the entire world. Each image is uploaded with a set of indexed tags such as book year (like 1879) or book subject (like sailing) and the book’s id (like bookidkittenscatsbooko00grov). Each image’s description has detailed information about the image caption, text before and after the image, book’s title, and the page number. In the Flickr Commons, the images themselves carry no known copyright restrictions; this itself carries new implications for libraries and archives online. In a recent conversation about this collection, Cathy Marshall stated:
“…here are over 2.5 million images, ripe for reuse and reorganized into different collections and subcollections. Do we necessarily want to read the whole of a monograph about polycystins from 1869? Probably not. But might we use a stippled protozoan illustration as a homescreen background (in this case, with little concern for copyright restrictions). And we would be in good company: in a recent study, Frank Shipman and I found that over 80% of our participants (202 out of 242) downloaded photos they found on the Internet with little concern for copyright restrictions. Almost 3/4 reused these photos in new contexts. It’s easy to see how book images, extracted and offered without copyright concerns, would be an attractive online resource for purposes many of us haven’t even foreseen yet.”
We also look towards annotations. From people in the communities and in the libraries, human annotation can help classify – formally and informally – the content to be discovered in the books. Even signals that one might take for granted, such as marking an image as a favorite or leaving a comment, can be quite valuable in social computing to further understand a corpus and help tell the stories contained across all the books. The structured data and human annotations here are the first steps; computer vision systems can further index concepts.
We are just now at the beginning of what is possible with this collection. From here, we hope to begin a transformation of the Internet Archive’s collections into a dynamic, growing collection where concepts, objects, locations and even music can be discovered, not as just an index of text around images, but rather a deeper knowledge model that utilizes the structure of books, publishing and libraries to understand the world and allow scholars to begin asking an entirely new generation of research questions.