INTERNETARCHIVE.BAK/git-annex implementation

From Archiveteam
Jump to navigation Jump to search

This page addresses a git-annex implementation of INTERNETARCHIVE.BAK.

For more information, see

First tasks

Some first steps to work on:

  • Get a list of files, checksums, and urls. (done)
  • Write a script to generate a git-annex repository with 100k files from the list. (done)
  • Set up a server to serve up the git repos. Any linux system with a few hundred gb of disk and ssh and git-annex installed will do. It needs to accept incoming ssh connections from registered clients, only letting them run git-annex-shell. (done)
  • Put one shard repo on the server to start. (done)
  • Manually register a few clients to start, have them manually download some files, and `git annex sync` their state back to the server. See how it all hangs together. (in progress)
  • Get that first shard backed up enough to be able to say, "we have successfully backed up 1/1770th of the IA!"

Middle tasks

  • Test a restore of that first shard. Tell git-annex the content is no longer in the IA. Get the clients to upload it to our server.

Later tasks

  • Create all 1770 shards, and see how that scales.
  • Write pre-receive git hook, to reject pushes of branches other then the git-annex branch (already done), and prevent bad/malicious pushes of the git-annex branch
  • Write client registration interface, which generates the client's ssh private key, git-annex UUID, and sends them to the client
  • Client runtime environment (docker image maybe?) with warrior-like interface (all that needs to do is configure things and get git-annex running)


This is our first part of the IA that we want to get backed up. If we succeed, we will have backed up 1/1770th of the Internet Archive. This git-annex repository contains 100k files, the entire collections "internetarchivebooks" and "usenethistorical".

git clone SHARD1@

You need to get your .ssh/ added to be able to access this. Ask closure on IRC (EFNET #internetarchive.bak) for now..

To play with this, just git clone it, and then start git-annex downloading some of the files to back up. git annex get --not --copies 2 (That will back up any files that don't have 2 known copies already, including the IA as a copy. If it doesn't find enough files, change to --copies 3 etc)

After you have downloaded some files, let the central repo know you're backing them up by running: git annex sync

Some stats about the files this repository is tracking:

  • number of files: 103343
  • total file size: 2.91 terabytes
  • size of the git repository itself is 51 megabytes

Note that due to the IA census using md5sums, you need git-annex version 5.20150205 to run git-annex fsck in this repository.

Older verisons of git-annex will work for everything else, but not fsck.

One way to install that version is

tuning your repo

So you want to back up part of the IA, but don't want this to take over your whole disk or internet pipe? Here's some tuning options you can use.. Run these commands in the git repo you checked out.

git config annex.diskreserve 200GB

This will prevent git-annex from using up the last 200gb of your disk. Adjust to suite.

git config annex.web-options=--limit-rate=200k

This will limit wget/curl to downloading at 200 kb/s. Adjust to suite.