The Digital Content Management section has been working on a project to extract and make available sets of files from the Library’s significant Web Archives holdings. This is another step to explore the Web Archives and make them more widely accessible and usable. Our aim in creating these sets is to identify reusable, “real world” content in the Library’s digital collections, which we can provide for public access. The outcome of the project will be a series of datasets, each containing 1,000 files of related media types selected from .gov domains. We will announce and explore these datasets here on The Signal, and the data will be made available through LC Labs. Although we invite usage and interest from a wide range of digital enthusiasts, we are particularly hoping to interest practitioners and scholars working on digital preservation education and digital scholarship projects.
Understanding the 1,000 Image Files Dataset
What do a pixel art rendering of Herndon Elementary School, a patent diagram for a golf putter, a visualization of weather forecast data, and a classic piece of clip art have in common?
Stumped? You can find each of them in our newest dataset of 1,000 files with image media types—as purported in the extracted metadata—randomly selected from archived .gov domain sites. (Our three previous posts cover PDF, audio, and tabular data media types.) Images play a key element in both the structure of web design and as visual resources displayed as part of web content. This dataset provides a unique way to explore the roles that digital image files play in the development of government web design.
According to the metadata for these web objects, the four most common image formats in this random set are GIF, JPEG, PNG, and TIFF. A few other formats are included in the sample (e.g., BMP, PJPG), but they are of negligible quantity. The total size of the image dataset is approximately 163 MB. The largest single file is 25 MB, the smallest file is 51 bytes, and the median size is 153 KB. Items in this set were harvested during web crawls conducted over two decades, from 1996 to 2017. Most objects in this dataset were harvested between 2008 and 2017.
The following chart shows the number of image files (individually plotted by the file formats reported in our metadata) in the dataset from each year. This thing is brimming with points of interest and larger trends that warrant further analysis!
Standout points of interest include the large spikes between 2008 – 2009 and 2014 – 2016. The 2009 spike shows a dramatic increase in the amount of JPEGs in the sample, but the ratio of JPEGs to other formats looks relatively consistent with the preceding years. It is fair to consider that an event or multiple events resulted in heavier crawling of familiar government sites, particularly due to how the consistent ratio is maintained even as activity wanes in the following years. Perhaps the spikes are related to government elections or some other high-profile, regularly scheduled event.
Let us switch our focus to formats to see if a different perspective can help make sense of the spikes and other points of interest. The consistently low ratio of TIFFs is understandable: this format is not generally viewed as web-friendly since it is an uncompressed format typically resulting in very large file sizes which do not work well for web content. JPEG, PNG, and GIF, however, are optimized for efficient transmission across the web, which means that the ebbs and flows—and jarring spikes—on the chart can likely be explained by contextual clues, such as the content of the images or which government organization is doing the publishing. But before we dive into actual content, let’s take one more look at the metadata for image dimensions.
The above scatterplot shows the distribution of images by dimensions, with width and height measured in pixels (px), which are the smallest component elements of a digital image can be displayed on a screen, and represented on the x and y axes. Our smallest image is 10 px by 10 px, but that data point may prove hard to spot amidst the many others cramped in the long tail! Assuming that we have a monitor with 75 PPI (Pixels Per Inch), this image should measure in at about .13’’ x .13’’. The largest image in our set is 7651 px by 7651 px, which comes in at 102.1’’ x 102.1’’. (Remember to view the image at full size if you want to appreciate the scale!)
Over half the total images (526) are under 400 px by 500 px (roughly 5.33’’ x 6.66’’). A closer look at that grouping reveals something even more striking: 425 of those images are under half that size! This seems a perfect time to check out some of the archived images to get a better sense of their varied contexts within government publishing on the web.
Infrastructural imagery, or Who captures the CAPTCHAs?
Let’s first dive in to the smaller end of the image size spectrum. Here we find a tiny cyclist icon (GIF; 18 px by 21 px; 24’’ x .28’’), a black box (GIF; 5 px by 25 px; .33’’ x .07’’), a multi-colored stack of tiny blocks (PNG; 28 px by 12 px; .16’’ x .37’’), a left-facing arrow (GIF; 21 px by 23 px; .31’’ x .28’’), a clickable image that says “Get a Law Book” (GIF; 17 px by 94 px; 1.25’’ x .23’’), and so on. The dataset has a significant amount of what will be referred to here as infrastructural images; that is, images which function primarily to help users organize, comprehend, and navigate online content.
Moving on, there are thirty two harvested PNGs from the Oregon U.S District Court with the exact same dimensions, 32 px by 90 px (1.2’’ x .43’’). These images present another form of infrastructural imagery: CAPTCHA images (Completely Automated Public Turing test to tell Computers and Humans Apart) that are automatically generated for the purpose of user identification/verification. The ephemeral nature of these tiny tests makes them easy to overlook, but the sheer quantity of CAPTCHAS affirms their significance circa 2016.
Looking at the Big(ger) Picture(s)
On the other end of the spectrum we find larger images that function more as content a user explicitly seeks out to view. This first example is one of our largest images in both dimensions (5,024 px by 5,024 px; TIFF; 66.99’’ x 66.99’’) and file size (25 MB): a TIFF from the 2010 Western Lake Michigan NOAA NGS DSS Infrared 8 Bit Imagery dataset.
Looking further, there are images from several other government organizations, including a visualization of a directory layout from NASA (12.9 KB; GIF; 14.8’’ x 8.63’’); a diagram from Fermilab, a particle physics and accelerator laboratory (1 MB; JPG; 48’’ x 29’’); and a chart from the Department of Commerce regarding U.S imports of structural pipe and tube (26.9 KB; GIF; 8’’ x 12.8’’). Other points of interest include images with unusual dimensions, like this 6,000 px by 135 px image, which is a photo of Earth generated from Wide Field Camera data (418 KB; PNG; 80’’ x 1.8’’), part of the Hubble Space Telescope.
Finally, there seem to be images that are dynamically generated based on available data. This soundings chart for Tapachula, Mexico (25.4 KB; GIF; 6.66’’ x 9.33’’), is likely powered by data from NOAA’s Storm Prediction Center. Meanwhile, the July 29, 2000, sounding for 18N 83W is blank possibly because the archived file is isolated from the server-based backend that provides data for rendering the sounding images.
We hope this post serves as inspiration to data wranglers to explore this and our other .gov datasets. As more researchers dig into the Library’s massive Web Archives, we can reflect on ways in which the Library can better facilitate their work. Researchers coming in to study images in the web archives would do well to go in thinking about the dual role that images play as structural elements in web design and as visual resources integrated into websites. Both kinds of images are of potential research interest. Perhaps continued analysis of the Web Archives will uncover more general types of images (e.g., infrastructural imagery, photographs, charts), along with their technical characteristics, and this information will be used to form more nuanced queries.
So that’s all for this entry of In the Library’s Web Archives, though we have only scratched the surface of the sample images. Did a particular aspect of the discussion pique your curiosity? Are there other tools or bits of technical metadata that have been overlooked? Please share your thoughts in the comments below!