Skip to Main Content
George Mason University Infoguides | University Libraries
See Updates and FAQss for the latest library services updates.

Archiving Digital Projects

Guidelines for how to archive a digital project. This guide walks through how the interface, data, and code layers of the digital project can be captured, with recommended tools and resources.

Goal

Provide a snapshot of the look and feel of a digital project at a particular moment in time. This snapshot enables future visitors to interact with the site even if it isn’t “live” anymore.

How

  • Use a web crawler (see list below) to create a .warc (web archive) file from a set of web pages. This .warc file contains the HTML, CSS, JavaScript, and media files (images, video, sound, etc) that were delivered to the browser at the time of capture.

  • Upload the zipped .warc file to an institutional repository for preservation.

  • Submit the site link to the Internet Archive to ensure that an archive of the site is also available through the Internet Archive.

Tools

WebRecorder: https://webrecorder.io

  • Webrecorder is a product of the Rhizome project and is funded by Mellon. Using the webrecorder interface, users can “record” each webpage and interaction they wish to capture. The product recently (2016) came out of beta and is an excellent solution for smaller websites or situations where the site relies heavily on interactive javascript or embedded elements to create the web experience.

 

Internet Archive: https://archive.org/web/

  • The Internet Archive is the main entity archiving the web and is the central place for finding archived versions of web domains. Including one’s site in the Internet Archive will make it available to visitors using tools such as the Internet Archive plugin for Firefox.

 

Heritrix: https://webarchive.jira.com/wiki/display/Heritrix

  • Heritrix is the java-based crawler used by the Internet Archive to crawl the web and capture sites. It is useful for large sites and sites with standard HTML and CSS elements.

 

Web Archive Player: https://github.com/ikreymer/webarchiveplayer

  • Web Archive Player is a desktop tool for “playing back” .warc files. It comes with easy installation files for Mac and Windows.

Things to consider

  • Third party crawlers, particularly the Internet Archive, will respect the information found in your robots.txt file. If you have blocked crawling during development, be sure to update your robots file once the project is completed so that it can be crawled.
  • As browsers update, the display of older versions of the site will change and possibly break. If there are particular views that are important to the argument of a digital project, it is wise to capture these in a “static” format such as a screen shot.