One survey participant suggested “creating new art” as a possible use of web archives. It’s an idea that I ran a bit with, using skills picked up during the Archives Unleashed Datathon I attended, as well as those I’ve developed in the Programming for Cultural Heritage course at Pratt.
The output of this stage of the project was largely inspired by a meeting I had in the fall with Ashley Hinshaw, the current NYARC Kress Fellow, while she was stationed at the Museum of Modern Art library. For my visit Hinshaw brought out several books in the Library of the Printed Web collection, which the library acquired in 2017. Assembled by artist Paul Soulellis, the collection features artist’s books that use web content in innovative and provocative ways, pushing readers to question ideas about appropriation, copyright, and the vast, sometimes messy, expanses of content on the web (Soulellis, 2014).
Transforming digital information into a physical form has also gained some traction in the data science field. Coined “data physicalizations,” researchers believe some of the benefits to creating physical manifestations of data visualizations include greater accessibility, improvements to cognition and learning, and more (Jansen, et. al., 2015). The List of Physical Visualizations and Related Artifacts is an excellent catalog of data physicalization projects (as well as some great pre-digital examples).
My goal for this part of the project was to translate a data analysis of NYARC’s Brooklyn Museum web archive collection into a physicalization in the spirit of the Library of the Printed Web and the data physicalizations displayed in the catalog above, while challenging the nature of digital and craft work and the assumptive roles gender still plays in them: Who designs and manipulates the programs to archive the web? Who does the archiving? Who performs research using web archives?
Because I’d been QAing the Brooklyn Museum web archive during my fellowship, I’d grown very familiar with the site, so it made sense to work with it for this part of the project. The first step was to extract data from the web archive utilizing the Archives Unleashed Cloud, an open-source analysis tool that was designed to help researchers analyze web archives. One current drawback is that it’s only accessible to those with an existing Archive-It account (though some datasets drawn from web archives are available openly to researchers, thanks to a team made up of some of the same AU folks). I initiated an analysis of the collection, which “triggers an Apache Spark job and uses AUT to create a basic set of derivatives” (Archives Unleashed Project, 2019). Those derivatives include, among others, files that can be viewed and manipulated using the network analysis tool Gephi, which is what I planned to work with.
One is a GEXF file that has been processed by the AU Cloud tool to apply a layout to the hyperlink network diagram that visualizes the URL connections in the Brooklyn Museum web archive collection. Shown below (in a static version, unlike a version you’d be able to explore by interacting with it), you’ll notice a huge mass of purple dots — those represent the huge number of Tumblr links present in the archive. Digging into that a bit more, I found that the museum does have a Tumblr page, which is still active as of this writing, but I saw from the crawl reports that it was only captured for a brief period: once initially in 2015, and then weekly from November 2016 to September 2017. NYARC has captured some social media presences related to the institutions they’re collecting, so it’s not a surprise to see this in there. But the vast number of Tumblr links isn’t particularly useful to this analysis, and muddies up some of the relationship information that we might be able to draw from it.
I decided to try to limit the data to exclude the Tumblr URLs. First working with the GEFX file, and then trying it out with the other Gephi derivative (a GraphML file that hadn’t had any layouts applied to it), I found I could use some training in Gephi. Archives Unleashed has a useful tutorial, but it unfortunately doesn’t cover what I was looking for. I played with regular expressions and applied a number of filters, but I never determined how to filter out those URLs. I did enjoy watching the Yifan Hu layout being applied to the data, though.
It’s a tad melodramatic, but watching all the nodes unfold and expand, it was like watching the recent past manifest on the screen, heavy and black with shades of grey at the edges. We’re dealing with enormous changes while unthinkable tragedies play out all around us. While existing in that, playing around in Gephi feels incredibly inconsequential. So that would have to be it.
Now on to the physicalizing. The pixelated nature of cross stitch was ideal for this project: Tiny Xs sewn into a graph of fabric is like zooming in too far on the image above. Cross stitch can embody the digital nature of this object better than most other crafts, so that was my choice. I found a program, developed by Paul Reed, that uses Python to generate a cross-stitch pattern from any .jpg file. The code needed a small update for Python 3, and the package was missing a .csv file that contained thread colors converted to RGB colors; I made the update and adapted this chart to work for the program (the forked project is available here). And then I ran the program using the image shown above.
I wasn’t able to recreate it perfectly (purchasing crafting supplies during a pandemic means accepting, gratefully, whatever you can get), but it’s close. Most of all, though, during a stressful time, it was both distracting and comforting to have something to do that didn’t involve staring at a screen — even though, of course, I was fully aware that it was inherently connected to screens.
The result is a physical representation of digital data that was drawn from web content. It’s a small bit of art that’s something like the reverse of the digitization projects that so many cultural institutions have been undertaking in the past several years. It’s also a reminder that there are physical aspects to digital objects, and that digital collections such as web archives require practical skills and labor, akin to learning a craft, to collect, preserve, and provide access to them.