Revision as of 18:59, 6 April 2015

This page addresses a git-annex implementation of INTERNETARCHIVE.BAK.

For more information, see http://git-annex.branchable.com/design/iabackup/

First tasks

Some first steps to work on:

Get a list of files, checksums, and urls. (done)
Write a script to generate a git-annex repository with 100k files from the list. (done)
Set up a server to serve up the git repos. Any linux system with a few hundred gb of disk and ssh and git-annex installed will do. It needs to accept incoming ssh connections from registered clients, only letting them run git-annex-shell. (done)
Put one shard repo on the server to start. (done)
Manually register a few clients to start, have them manually download some files, and `git annex sync` their state back to the server. See how it all hangs together. (done)
Get that first shard backed up enough to be able to say, "we have successfully backed up 1/1770th of the IA!" (done!)

Middle tasks

get fscking and dead client expiry working for 1st shard
Test a restore of that first shard. Tell git-annex the content is no longer in the IA. Get the clients to upload it to our server.

Later tasks

Create all 1770 shards, and see how that scales.
Write pre-receive git hook, to reject pushes of branches other then the git-annex branch (already done), and prevent bad/malicious pushes of the git-annex branch
Write client registration interface, which generates the client's ssh private key, git-annex UUID, and sends them to the client
Client runtime environment (docker image maybe?) with warrior-like interface (all that needs to do is configure things and get git-annex running)

SHARD1

This is our first part of the IA that we want to get backed up. If we succeed, we will have backed up 1/1770th of the Internet Archive. This git-annex repository contains 100k files, the entire collections "internetarchivebooks" and "usenethistorical".

To help backing up shard1, checkout this git repository: <https://github.com/ArchiveTeam/IA.BAK>

The iabak script will set you up and get you downloading files from the IA into your backup drive.

Some stats about the files this repository is tracking:

number of files: 103343
total file size: 2.91 terabytes
size of the git repository itself was 51 megabytes to start
after filling up shard1, the git repo had grown to 196 mb
We aimed for 4 copies of every file downloaded, but a few files got 5-8 copies made, due to eg, races and manual downloads. Want to keep an eye on this with future shards.
We got SHARD1 fully downloaded between April 1-6th. It took a while to ramp up as people came in, so later shards may download faster. Also, 2/3 of SHARD2 was

downloaded during this same time period.

Status

You can find an initial graph of the status of here, and exact numbers here.

@@ Line 41: / Line 41: @@
 * after filling up shard1, the git repo had grown to 196 mb
 * We aimed for 4 copies of every file downloaded, but a few files got 5-8 copies made, due to eg, races and manual downloads. Want to keep an eye on this with future shards.
-* We got SHARD1 fully downloaded working from April 1 to April 6th. It took a while to ramp up, so later shards may download faster. Also, 2/3 of SHARD2 was
+* We got SHARD1 fully downloaded between April 1-6th. It took a while to ramp up as people came in, so later shards may download faster. Also, 2/3 of SHARD2 was
 downloaded during this same time period.

Difference between revisions of "INTERNETARCHIVE.BAK/git-annex implementation"

Revision as of 18:59, 6 April 2015

Contents

First tasks

Middle tasks

Later tasks

SHARD1

Status

Navigation menu

Difference between revisions of "INTERNETARCHIVE.BAK/git-annex implementation"

Revision as of 18:59, 6 April 2015

First tasks

Middle tasks

Later tasks

SHARD1

Status

Navigation menu

Search