Restoring

From Archiveteam
Jump to navigation Jump to search

So, you have a website that's gone, for whatever reason, there's a copy in the Wayback Machine, and now you want to get it all back, preferably without clicking on every single page.

If you're lucky, the grab was done by Archive Team and the WARC file will be available in the Archive Team collection where you can just download the whole thing and then extract contents with one of the WARC tools like warctozip.

If it wasn't an Archive Team grab and the pages were just scooped up as part of normal Wayback Machine operation, things are a bit more difficult as archive.org does not allow you to download these crawls directly (and the files you want would be split across many grabs anyway as they do not grab just one site at a time).

The Wayback Machine doesn't intentionally try to prevent you from downloading, but the usual method of recursive downloading with wget using -np does not work because the Wayback Machine date-codes URLS based on the time of crawl and so things which appear to be in the same directory are not. E.g.:

http://web.archive.org/web/20140208214426/http://archiveteam.org/index.php?title=Main_Page

links to

http://web.archive.org/web/20140215063724/http://archiveteam.org/index.php?title=Who_We_Are

which has a different date code, so it fails the "no parent" test.

Tools to use

  • Warrick - Main site was at [1] but seems down. Downloads are available at [2]. Your mileage may vary - it's quite slow to run and the feature of grabbing from Google/Yahoo/Bing caches doesn't seem to work. But currently probably the best option.
  • Wayback Downloader - effectiveness unknown, costs $15

Tricks

This is undocumented, but if you retrieve a page with id_ after the datecode, you will get the original document with all the Wayback scripts, header stuff, and link rewriting removed. This is useful for restoring a single page or when writing a tool to retrieve a site:

http://web.archive.org/web/20051001001126id_/http://www.archive.org/