Fellowship Overview

Web Archiving

I spent my fellowship working with the NYARC web archive collections. Web archives collect the content of a live website, preserve it, and provide access to it at a later date, ideally providing the same experience that a user would have had when viewing and interacting with the live website. This is different from a digital archive, which collects, preserves, and provides access to digitized versions of physical materials (like printed photographs that have been scanned) or other born-digital materials (like digital photographs or emails).

Check out this glossary of terms to decipher some of the terminology that follows.

Different web crawling software exists to gather the underlying code and other documents of websites into the WARC file format, an archival format that can then be rendered later using a playback tool. The Internet Archive’s Wayback Machine is probably the best known of these tools, allowing users to view past versions of websites that have been captured by crawlers over time. While it’s possible to crawl and capture websites using open-source software, many organizations, including NYARC, choose to employ the Internet Archive’s Archive-It service, which provides user-friendly administration of crawls, a full-text searchable interface for the public to use collections, and technical support.

While Archive-It has a lot of advantages for subscribers, web archiving is still a work in progress. Capturing static HTML documents is generally easier to do than capturing dynamic, JavaScript-based websites, so web crawlers encounter challenges regularly. For instance, websites that feature a more straightforward HTML design, like RIHA Journal, will be captured very well, so users interacting with an archived version of the site will have a nearly identical experience to those who interacted with the live site. Other times, crawlers may not capture all of the pieces required to make that experience just right. There are a lot of reasons for this, from time or data limits that web archivists apply to crawls, to the challenges that these algorithms encounter with dynamic web content. As a result, quality assurance (QA) is required to ensure web archives present websites as the closest representation to the live site as possible.

A 2019 web archive capture of the Brooklyn Museum website (above) compared to the live site (below) shows the crawler missed some thumbnail images.

Much of my fellowship was spent QAing web archives in the NYARC collection. I spent most of my time with the Brooklyn Museum website, which is a fairly large, complex site. Over time, I’ve addressed crawler-trap issues that may have been impacting the crawler’s ability to capture tens of thousands of images, determined instances where using the Brozzler crawler is preferable to Heretrix, and dealt with crawl delays that were impacting the efficiency of archiving (with a deal of gratitude to both Sumitra Duncan at the Frick and Karl-Rainer Blumenthal at the Internet Archive). Combined with software updates the result is a much different crawl compared to what NYARC was working with when I began the fellowship in August.


Professional Development in the Use of Web Archives

In addition to the on-site and remote web archiving work, I was also fortunate to attend two web archiving workshops over the past year, both of which allowed me to investigate my own research interests in the scholarly use of web archives and learn more about the tools available to researchers who are interested in using web archives in their work.

The first was a Continuing Education to Advance Web Archiving (CEDWARC) workshop, held at George Washington University on October 28, 2019. The schedule was packed presentations and labs about some of the tools available for working with web archives, which was invigorating, if a bit overwhelming (or, as Zhiwu Xie of Virginia Tech and one of the investigators on the CEDWARC project said at the end of the day: “Right now it may feel like you’re drinking from a firehose.”). My main takeaways were the two common points that emerged from several participants during the wrap-up segment: parallel needs for a community of practice and outreach that leads to increased use of web archives by researchers.

During the end-of-day wrap-up, CEDWARC presenters discussed the challenges researchers face in using web archives.

The second was the Archives Unleashed New York Datathon on March 26-27, 2020. Originally scheduled to take place at Columbia University, the organizers quickly and conscientiously moved the event online because of COVID-19. (In the grand scheme of everything that’s happening in New York City and around the world during this pandemic, moving this workshop online was not a tragedy. But I mourn these missed opportunities to connect in real life; virtual networking isn’t always the same — and this is coming from someone who has irl LiveJournal friends.) Still, this was some good lemonade. It was a great chance to get hands-on experience with large web archives datasets, which helped inform a part of this practicum project, and my team put together a pretty cool project (in which I learned what red can refer to in Spanish).

Using AntConc, I analyzed derivatives from the Latin American and Caribbean Contemporary Art Web Archive to see how color terms are represented on the websites.
Create your website with WordPress.com
Get started
%d bloggers like this: