I spent my fellowship working with the NYARC web archive collections. Web archives collect the content of a live website, preserve it, and provide access to it at a later date, ideally providing the same experience that a user would have had when viewing and interacting with the live website. This is different from a digital archive, which collects, preserves, and provides access to digitized versions of physical materials (like printed photographs that have been scanned) or other born-digital materials (like digital photographs or emails).
Different web crawling software exists to gather the underlying code and other documents of websites into the WARC file format, an archival format that can then be rendered later using a playback tool. The Internet Archive’s Wayback Machine is probably the best known of these tools, allowing users to view past versions of websites that have been captured by crawlers over time. While it’s possible to crawl and capture websites using open-source software, many organizations, including NYARC, choose to employ the Internet Archive’s Archive-It service, which provides user-friendly administration of crawls, a full-text searchable interface for the public to use collections, and technical support.
Much of my fellowship was spent QAing web archives in the NYARC collection. I spent most of my time with the Brooklyn Museum website, which is a fairly large, complex site. Over time, I’ve addressed crawler-trap issues that may have been impacting the crawler’s ability to capture tens of thousands of images, determined instances where using the Brozzler crawler is preferable to Heretrix, and dealt with crawl delays that were impacting the efficiency of archiving (with a deal of gratitude to both Sumitra Duncan at the Frick and Karl-Rainer Blumenthal at the Internet Archive). Combined with software updates the result is a much different crawl compared to what NYARC was working with when I began the fellowship in August.
Professional Development in the Use of Web Archives
In addition to the on-site and remote web archiving work, I was also fortunate to attend two web archiving workshops over the past year, both of which allowed me to investigate my own research interests in the scholarly use of web archives and learn more about the tools available to researchers who are interested in using web archives in their work.
The first was a Continuing Education to Advance Web Archiving (CEDWARC) workshop, held at George Washington University on October 28, 2019. The schedule was packed presentations and labs about some of the tools available for working with web archives, which was invigorating, if a bit overwhelming (or, as Zhiwu Xie of Virginia Tech and one of the investigators on the CEDWARC project said at the end of the day: “Right now it may feel like you’re drinking from a firehose.”). My main takeaways were the two common points that emerged from several participants during the wrap-up segment: parallel needs for a community of practice and outreach that leads to increased use of web archives by researchers.
The second was the Archives Unleashed New York Datathon on March 26-27, 2020. Originally scheduled to take place at Columbia University, the organizers quickly and conscientiously moved the event online because of COVID-19. (In the grand scheme of everything that’s happening in New York City and around the world during this pandemic, moving this workshop online was not a tragedy. But I mourn these missed opportunities to connect in real life; virtual networking isn’t always the same — and this is coming from someone who has irl LiveJournal friends.) Still, this was some good lemonade. It was a great chance to get hands-on experience with large web archives datasets, which helped inform a part of this practicum project, and my team put together a pretty cool project (in which I learned what red can refer to in Spanish).