Revision as of 22:43, 15 January 2017

This page addresses a git-annex implementation of INTERNETARCHIVE.BAK.

MOTHERFUCKER ! ! !

Status

Graphs of status

raw data

Implementation plan

For more information, see http://git-annex.branchable.com/design/iabackup/

First tasks

Some first steps to work on:

Get a list of files, checksums, and urls. (done)
Write a script to generate a git-annex repository with 100k files from the list. (done)
Set up a server to serve up the git repos. Any linux system with a few hundred gb of disk and ssh and git-annex installed will do. It needs to accept incoming ssh connections from registered clients, only letting them run git-annex-shell. (done)
Put one shard repo on the server to start. (done)
Manually register a few clients to start, have them manually download some files, and `git annex sync` their state back to the server. See how it all hangs together. (done)
Get that first shard backed up enough to be able to say, "we have successfully backed up 1/1770th of the IA!" (done!)

Middle tasks

get fscking and dead client expiry working (done)
Test a restore from a shard. Tell git-annex the content is no longer in the IA. Get the clients to upload it to our server.
Write client registration interface, which generates the client's ssh private key, git-annex UUID, and sends them to the client (done)
Help the user get the iabak-cronjob set up.
Email expire warnings (done)

Later tasks

Create all 1770 shards, and see how that scales.
Write pre-receive git hook, to reject pushes of branches other then the git-annex branch (already done), and prevent bad/malicious pushes of the git-annex branch
Client runtime environment (docker image maybe?) with warrior-like interface (all that needs to do is configure things and get git-annex running)

SHARD1

This is our first part of the IA that we want to get backed up. If we succeed, we will have backed up 1/1770th of the Internet Archive. This git-annex repository contains 100k files, the entire collections "internetarchivebooks" and "usenethistorical".

Some stats about the files this repository is tracking:

number of files: 103343
total file size: 2.91 terabytes
size of the git repository itself was 51 megabytes to start
after filling up shard1, the git repo had grown to 196 mb
We aimed for 4 copies of every file downloaded, but a few files got 5-8 copies made, due to eg, races and manual downloads. Want to keep an eye on this with future shards.
We got SHARD1 fully downloaded between April 1-6th. It took a while to ramp up as people came in, so later shards may download faster. Also, 2/3 of SHARD2 was downloaded during this same time period.

Admin details

See INTERNETARCHIVE.BAK/admin.

@@ Line 1: / Line 1: @@
 This page addresses a [https://git-annex.branchable.com git-annex] implementation of [[INTERNETARCHIVE.BAK]].
-= Quickstart =
+'''MOTHERFUCKER ! ! !'''
-Do this on the drive you want to use:
+'''MOTHERFUCKER ! ! !'''
-<pre>
+'''MOTHERFUCKER ! ! !'''
-$ git clone https://github.com/ArchiveTeam/IA.BAK
-$ cd IA.BAK
-$ ./iabak
-</pre>
-It will walk you through setup and starting to download files, and install a cron job (or .timer unit) to perform periodic maintenance.
-It should prompt you for how much disk space to not use. To adjust this value later, use <code>git config annex.diskreserve 200GB</code> in all of the <code>IA.BAK/shard*</code> directories.
-Configuration and maintenance information can be found in the README.md file. (Also available at https://github.com/ArchiveTeam/IA.BAK/#readme)
-=== Dependencies ===
-* sane UNIX environment (shell, df, perl, grep)
-* git
-* crontab OR systemd (NOTE: you may need to run <code>loginctl enable-linger <user></code> to make sure the job is not killed)
-* <code>shuf</code> (optional - will randomize the order you download files in)
 = Status =

Difference between revisions of "INTERNETARCHIVE.BAK/git-annex implementation"

Revision as of 22:43, 15 January 2017

Contents

Status

Implementation plan

First tasks

Middle tasks

Later tasks

SHARD1

Admin details

Navigation menu

Difference between revisions of "INTERNETARCHIVE.BAK/git-annex implementation"

Revision as of 22:43, 15 January 2017

Status

Implementation plan

First tasks

Middle tasks

Later tasks

SHARD1

Admin details

Navigation menu

Search