InfoGuides: Archiving Digital Projects: Archive the Interface

Goal

Provide a snapshot of the look and feel of a digital project at a particular moment in time. This snapshot enables future visitors to interact with the site even if it isn’t “live” anymore.

How

Use a web crawler (see list below) to create a .warc (web archive) file from a set of web pages. This .warc file contains the HTML, CSS, JavaScript, and media files (images, video, sound, etc) that were delivered to the browser at the time of capture.
Upload the zipped .warc file to an institutional repository for preservation.
Submit the site link to the Internet Archive to ensure that an archive of the site is also available through the Internet Archive.

Tools

WebRecorder: https://webrecorder.io

Webrecorder is a product of the Rhizome project and is funded by Mellon. Using the webrecorder interface, users can “record” each webpage and interaction they wish to capture. The product recently (2016) came out of beta and is an excellent solution for smaller websites or situations where the site relies heavily on interactive javascript or embedded elements to create the web experience.

Internet Archive: https://archive.org/web/

The Internet Archive is the main entity archiving the web and is the central place for finding archived versions of web domains. Including one’s site in the Internet Archive will make it available to visitors using tools such as the Internet Archive plugin for Firefox.

Heritrix: https://webarchive.jira.com/wiki/display/Heritrix

Heritrix is the java-based crawler used by the Internet Archive to crawl the web and capture sites. It is useful for large sites and sites with standard HTML and CSS elements.

Web Archive Player: https://github.com/ikreymer/webarchiveplayer

Web Archive Player is a desktop tool for “playing back” .warc files. It comes with easy installation files for Mac and Windows.

Things to consider

Third party crawlers, particularly the Internet Archive, will respect the information found in your robots.txt file. If you have blocked crawling during development, be sure to update your robots file once the project is completed so that it can be crawled.
As browsers update, the display of older versions of the site will change and possibly break. If there are particular views that are important to the argument of a digital project, it is wise to capture these in a “static” format such as a screen shot.