Internet Archive

From Archiveteam
Revision as of 18:05, 6 March 2015 by Bzc6p (talk | contribs) (mentioned the ongoing discussion about backing up the archive)
Jump to navigation Jump to search
Internet Archive
Internet Archive mainpage in 2010-12-21
Internet Archive mainpage in 2010-12-21
URL http://www.archive.org[IAWcite.todayMemWeb]
Status Online!
Archiving status Upcoming...
Archiving type Unknown
IRC channel #archiveteam-bs (on hackint)

The Internet Archive is a non-profit digital library with the stated mission/motto: "universal access to all knowledge". The Internet Archive stores over 400 billion webpages from different dates and times for historical purposes that are available through the Wayback Machine, arguably an archivists wet dream. The Archive.org website also archives books, music, videos, and software.

Mirrors

There are currently two mirrors of the Internet Archive collection - the official mirror available at archive.org, and a second mirror at Bibliotheca Alexandrina. Both seem to be up and stable.

Raw Numbers

December 2010:

  • 4 data centers, 1,300 nodes, 11,000 spinning disks
  • Wayback Machine: 2.4 PetaBytes
  • Books/Music/Video Collections: 1.7 PetaBytes
  • Total used storage: 5.8 PetaBytes

August 2014:

  • 4 data centers, 550 nodes, 20,000 spinning disks
  • Wayback Machine: 9.6 PetaBytes
  • Books/Music/Video Collections: 9.8 PetaBytes
  • Unique data: 18.5 PetaBytes
  • Total used storage: 50 PetaBytes

Uploading to archive.org

Upload any content you manage to preserve! Registering takes a minute.

Tools:

  • For quick one-shot webpage archiving, use the Wayback Machine's "Save Page Now" tool.
  • S3 interface (for direct usage with curl, or indirect with the tool of your choice.)
  • internetarchive Python tool is one such tool.
  • Handy script for mass upload with automatic error checking and retry.
  • Torrent upload, useful if you need resume (for huge files or because your bandwidth is insufficient for upload in one go):
    • Just create the item, make a torrent with your files in it, name it like the item, and upload it to the item.
    • archive.org will connect to you and other peers via a Transmission daemon and keep downloading all the contents till done;
    • For a command line tool you can use e.g. mktorrent or buildtorrent, example: mktorrent -a udp://tracker.publicbt.com:80/announce -a udp://tracker.openbittorrent.com:80 -a udp://tracker.ccc.de:80 -a udp://tracker.istole.it:80 -a http://tracker.publicbt.com:80/announce -a http://tracker.openbittorrent.com/announce "DIRECTORYTOUPLOAD" ;
    • You can then seed the torrent with one of the many graphical clients (e.g. Transmission) or on the command line (Transmission and rtorrent are the most popular; btdownloadcurses reportedly doesn't work with udp trackers.)
    • archive.org will stop the download if the torrent stalls for some time and add a file to your item called "resume.tar.gz", which contains whatever data was downloaded. To resume, delete the empty file called IDENTIFIER_torrent.txt; then, resume the download by re-deriving the item (you can do that from the Item Manager.) Make sure that there are online peers with the data before re-deriving and don't delete the torrent file from the item.

Don't use FTP upload, try to keep your items below 400 GiB size, add plenty of metadata.

Formats: anything, but:

  • Sites should be uploaded in WARC format;
  • Audio, video, books and other prints are supported from a number of formats;
  • For .tar and .zip files archive.org offers an online browser to search and download the specific files one needs, so you probably want to use either unless you have good reasons (e.g. if 7z or bzip2 reduce the size tenfold).

Downloading from archive.org

Backing up the Internet Archive

A discussion has begun about creating a distributed backup of the content of the Internet Archive. This is currently in the planning phase. For the initial manifesto, see the INTERNETARCHIVE.BAK page, for the records of the brainstorming, see its talk page, and to follow the discussion in real-time, join the #internetarchive.bak IRC channel on EFNet.

Let us clarify once again: ArchiveTeam is not the Internet Archive. This "backing up the Internet Archive" project, just like all the other website-rescuing ArchiveTeam projects are not ordered, asked for, organized or supported by the Internet Archive, nor are the ArchiveTeam members the employees of the Internet Archive (except a few ones). Besides accepting – and, in our case, providing – the content, the Internet Archive doesn't collaborate with ArchiveTeam.

See also

External links