How are researchers and scholars going to make use of born-digital primary sources? It’s an open question which many working in digital preservation are interested in. As part of the NDSA innovation working group’s ongoing Insights interview series I am excited to talk with Zach Whalen, an english professor at the University of Mary Washington, about a project he developed that works off a digital collection (of sorts) to explore this topic.
He describes a twitter bot he created called ROM TXT as “searching video game ROMs, tweeting strings of bytes from the ASCII range. Sometimes words. Every 5 minutes, for beauty.” In this interview we will unpack exactly what that is and means and what it has to do with born-digital collections.
Trevor: Could you give us a brief description of ROM TXT. What does it do? What source material does it draw from? What is it’s objective? It would be ideal if you could describe a few instances of what you think are the best examples of tweets it has made and why you find those particularly compelling.
Zach: Video games since about the mid 1970s run on code that’s stored as ROM (Read Only Memory) on unique chips inside their respective devices. Typically, these chips contain a combination of compiled code and data objects like images and sound files. All this can be extracted from those chips and stored for archival purposes, but for arcade games, one has to know quite a lot about the game in question and its CPU in order to analyze or reverse engineer it. This is what contributors to emulators like MAME do.
But even without knowing much about a specific game or differences among a large group of game ROMs, some educated guesswork can reveal parts of what’s going on behind our screens, including what might be text or what might be an image. ROM_TXT looks through ROM files, and when it thinks its found some data that could be words, it publishes that string as a tweet. Usually, it’s wrong, but those errors are some of my favorite tweets.
ROM_TXT really comes from two directions: my interest in twitter bots and my research in game history. For the latter, I’d been working on a way to automatically index and eventually search through text used in older arcade games. I wasn’t having much luck automatically separating text from data, and I ended up with a big file of mostly noise. It was interesting noise, though. At the same time, I’ve been interested in Twitter bots, so I tend to have my eyes open for interesting datasets or linguistic templates. I realized I had one on hand with my failed attempt at a search engine, and I considered trying to find a way to clean out the text fragments and combine it in random ways like an _ebooks-style bot might. But I eventually decided just to embrace the messiness, so I wrote a simple program that just randomly chooses lines from my big, messy data file.
Those tweets tend to fall into a few categories of nonsense. My favorites are those that make interesting shapes or visual patterns from the noise. I also enjoy when it treats something as text that clearly wasn’t encoded as text, like “DADA”, which is ironically appropriate. “Poop” and “lulu” are usually popular.
Probably about half of the tweets are actual text, and these tend to be combinations of text meant to be displayed on screen either for game play or debugging. Occasionally it’s an easter egg or other discarded code. These latter categories can be exciting because otherwise they would only be visible in hex editors. Many appear to be inside jokes among a game’s programmers. Adam Parrish (@aparrish) somehow noticed one that was a rot-13 encoded list of programmer’s names.
Trevor: Could you tell us a bit about the set of ROMS you are working from? How did you find them? What made them an interesting source for you to work from? If libraries and archives wanted to encourage this kind of playful use of born digital collections what would you want to see in how they provide access to them?
Zach: I started with a large torrent download of ROMs packaged together for use with the MAME software. I think it has files for about 6,000 different arcade games. MAME is software that lets you play those arcade games if you have the appropriate ROM files on hand, and people download ROMs to replay old games. But one of the great things about MAME is that for it to work properly, it has to know a lot of things about the ROMs and the games they’re used in. With that metadata, you can take any game file and probably know quite a bit about who made that game and when.
For example, within ROM_TXT, each tweet is signed with a hashtag (like “#bublbobl”) that corresponds to the same identifier MAME uses for that game (“Bubble Bobble”). I can click that hashtag to search every tweet that ROM_TXT has sent from “#bublbobl”, and I can interact with MAME from a command line (“~$ mame -listxml bublbobl”) to get all the data that MAME knows about that game, like that this particular version was published by Taito in 1986, that it used a Zilog Z80 processor, that it has 4 difficulty levels, etc.
Some might point out that the torrent is illegal distribution of copyrighted material, but I think (and I think the MAME developers share this understanding) that it’s an incredibly useful archival tool for preserving a kind of born-digital artifact that might otherwise be thought of as disposable ephemera. And for what it’s worth, I also think that what I’m doing with ROM_TXT is consistent with the clause of the DMCA that allows making copies for archival and analytical purposes.
Trevor: You say ROM TXT tweets out “strings of bytes from the ASCII range” in the ROMS. Could you unpack that for our readers? What process are you using to read the ROMS? I’m particularly interested in this point as much of the work in digital preservation assumes that the use for digital objects is to recreate their original user experience, a point of view which is often dismissed as “screen essentialist.” To that end, do you think ROM TXT has anything to tell us about how users of born digital collections might approach and read them?
Zach: The process I use to find strings of text isn’t very clever or very discriminating. Given a ROM file, I have no idea what parts of it are data, what kinds of data is stored there, or how to interpret that image. I do know, however, that the binary data in that file is almost certainly intended for an 8-bit processor like a Z80 or MOS 6502 and their various relatives. In other words, I know that every 8 bits is probably a byte, and every one of those bytes can be thought of as a number between 0 and 255.
ASCII is a standardized system for encoding text whereby a decimal number 65 (hexadecimal 41, binary 1000001) should be decoded as the letter A. The upper- and lower-case alphabet, arabic numerals 0 – 9, and punctuation marks all lie between numbers 32 and 126 on the ASCII table.
My code looks for sequences of at least three bytes falling within that range, and accumulates chunks of word candidates. Since the ways programmers stored this text is idiosyncratic, words are often stored without spaces between them and punctuation is used in weird ways. This means I can’t just do a dictionary search for each sequence (or “string”) and consider that a found word. So I make the strings of “text” it finds as long as possible, then search that string for any substrings of 4 or more letters that returns a positive response from a dictionary lookup.
This process results in quite a lot of false-positives, but because it’s language and platform agnostic, it can find things that might otherwise slip through some cracks. You mention screen essentialism, but I think there’s also such thing as code or platform essentialism, where the meaning of a digital work ends with an interpretation of what its code reveals it was meant to be doing. The artifactual fragments ROM_TXT produces sometimes reveal a glimpse into an inner life of a device in a way that makes no such assumptions about what that inner life means.
Trevor: Stepping back from this a bit. You’ve done a lot of work on video games and electronic literature. I would be curious to get your take on what kinds of things you would like to see more libraries and archives doing around collection development and modes of access in these areas.
Zach: I’m a big fan of the Media Archaeology Lab and the work MITH is doing with the collections from Bill Bly and Deena Larsen. The Learning Games Initiative at Arizona State has a nice archive they make available to game researchers, and Stony Brook has its William Higinbotham special collection. JSMESS is probably one of the coolest and most important projects in the area of digital preservation lately, and it’s important to note that it, like MAME, came about outside of traditional institutional support.
Each of these carry out an important preservation mission, but circulation and access are really important, too, especially for devices like arcade games or console games with quirky hardware (Vectrex or Virtual Boy, for example). As someone interested in media history, those quirks are where the action is. Experiencing something break or not work as expected can be a way to unpack the normative assumptions of ease that consumer electronics encourage us to take for granted.
I’m hoping to set up something like Media Archaeology Lab here at Mary Washington, and if I do, access will be an important part of its mission.
Trevor: You say that ROM TXT is doing what it does “for beauty.” What about the tweets it kicks out are beautiful? Do you think it would look different if it was done “for knowledge”?
Zach: Well, I guess I just feel “knowledge” would be too presumptuous. There’s a lot more pressure on it to be profound or interesting if I say it’s meant to be those things. So “beauty” is sort of my cop-out. I alluded earlier to how ROM_TXT came about as a byproduct of research, and occasionally it is still useful for that, but really the tweets I enjoy most are those that make me chuckle or raise an eyebrow, so I think that’s a set of responses more aligned with the aesthetic than the analytical.
Right now it’s got about 150 followers, which I actually think is quite a lot for a bot that tweets mostly nonsense every 5 minutes. I mean, that’s a lot to ask of a follower, but some people do seem to enjoy it. Personally, I follow it because its relentless stream of fresh weirdness keeps my timeline less predictable.
Trevor: A strand of work in the digital humanities has focused on the “deformance of texts.” One example that comes to mind is Mark Sample’s piece Hacking the Accident, an algorithmically altered version of Hacking the Academy intended as “a legitimate mode of scholarship.” Do you think ROM TXT can be thought of as a deformance of the collection of ROMS you are working from? If so, I would be curious to get a sense of what kinds of things we learn from the deformance.
Zach: I hadn’t thought of this as deformance until you suggested it, but I’ve experimented with that idea in my teaching and in making various silly web things. I absolutely do think deformance is a legitimate form of scholarship, but I worry it’s often the kind of thing that loses critical power the more one explains what it’s supposed to mean. The relationship between ROM_TXT and the ROMs it uses is definitely deformative in the sense that I’m not using them at all as intended, but I’m honestly not sure I can say what we learn from it. If I have to come up with something, I guess ROM_TXT highlights some invisible complexities of games and surfaces some of the heterogeneous discourses implicated in their production. Or maybe not. Maybe it’s just a bit of weirdness on Twitter.