Difference between revisions of "INTERNETARCHIVE.BAK/git-annex implementation"

From Archiveteam
Jump to navigation Jump to search
(Add quickstart section)
Line 30: Line 30:
 
= Middle tasks =
 
= Middle tasks =
  
* get fscking and dead client expiry working for 1st shard
+
* get fscking and dead client expiry working for 1st shard (done, but expiry is not running yet)
 
* Test a restore of that first shard. Tell git-annex the content is no longer in the IA. Get the clients to upload it to our server.
 
* Test a restore of that first shard. Tell git-annex the content is no longer in the IA. Get the clients to upload it to our server.
 +
* Write client registration interface, which generates the client's ssh private key, git-annex UUID, and sends them to the client (done)
  
 
= Later tasks =
 
= Later tasks =
Line 37: Line 38:
 
* Create all 1770 shards, and see how that scales.
 
* Create all 1770 shards, and see how that scales.
 
* Write pre-receive git hook, to reject pushes of branches other then the git-annex branch (already done), and prevent bad/malicious pushes of the git-annex branch
 
* Write pre-receive git hook, to reject pushes of branches other then the git-annex branch (already done), and prevent bad/malicious pushes of the git-annex branch
* Write client registration interface, which generates the client's ssh private key, git-annex UUID, and sends them to the client
 
 
* Client runtime environment (docker image maybe?) with warrior-like interface (all that needs to do is configure things and get git-annex running)
 
* Client runtime environment (docker image maybe?) with warrior-like interface (all that needs to do is configure things and get git-annex running)
  
Line 44: Line 44:
 
This is our first part of the IA that we want to get backed up. If we succeed, we will have backed up 1/1770th of the Internet Archive.
 
This is our first part of the IA that we want to get backed up. If we succeed, we will have backed up 1/1770th of the Internet Archive.
 
This git-annex repository contains 100k files, the entire collections "internetarchivebooks" and "usenethistorical".
 
This git-annex repository contains 100k files, the entire collections "internetarchivebooks" and "usenethistorical".
 
To help backing up shard1, checkout this git repository: <https://github.com/ArchiveTeam/IA.BAK>
 
 
The iabak script will set you up and get you downloading files from the IA into your backup drive.
 
  
 
Some stats about the files this repository is tracking:
 
Some stats about the files this repository is tracking:

Revision as of 01:27, 23 April 2015

This page addresses a git-annex implementation of INTERNETARCHIVE.BAK.

For more information, see http://git-annex.branchable.com/design/iabackup/

Quickstart

Do this on the drive you want to use:

$ git clone https://github.com/ArchiveTeam/IA.BAK

$ cd IA.BAK

$ ./iabak

Configuration and maintenance information can be found at https://github.com/ArchiveTeam/IA.BAK/

First tasks

Some first steps to work on:

  • Get a list of files, checksums, and urls. (done)
  • Write a script to generate a git-annex repository with 100k files from the list. (done)
  • Set up a server to serve up the git repos. Any linux system with a few hundred gb of disk and ssh and git-annex installed will do. It needs to accept incoming ssh connections from registered clients, only letting them run git-annex-shell. (done)
  • Put one shard repo on the server to start. (done)
  • Manually register a few clients to start, have them manually download some files, and `git annex sync` their state back to the server. See how it all hangs together. (done)
  • Get that first shard backed up enough to be able to say, "we have successfully backed up 1/1770th of the IA!" (done!)

Middle tasks

  • get fscking and dead client expiry working for 1st shard (done, but expiry is not running yet)
  • Test a restore of that first shard. Tell git-annex the content is no longer in the IA. Get the clients to upload it to our server.
  • Write client registration interface, which generates the client's ssh private key, git-annex UUID, and sends them to the client (done)

Later tasks

  • Create all 1770 shards, and see how that scales.
  • Write pre-receive git hook, to reject pushes of branches other then the git-annex branch (already done), and prevent bad/malicious pushes of the git-annex branch
  • Client runtime environment (docker image maybe?) with warrior-like interface (all that needs to do is configure things and get git-annex running)

SHARD1

This is our first part of the IA that we want to get backed up. If we succeed, we will have backed up 1/1770th of the Internet Archive. This git-annex repository contains 100k files, the entire collections "internetarchivebooks" and "usenethistorical".

Some stats about the files this repository is tracking:

  • number of files: 103343
  • total file size: 2.91 terabytes
  • size of the git repository itself was 51 megabytes to start
  • after filling up shard1, the git repo had grown to 196 mb
  • We aimed for 4 copies of every file downloaded, but a few files got 5-8 copies made, due to eg, races and manual downloads. Want to keep an eye on this with future shards.
  • We got SHARD1 fully downloaded between April 1-6th. It took a while to ramp up as people came in, so later shards may download faster. Also, 2/3 of SHARD2 was downloaded during this same time period.

Status

You can find an initial graph of the status of here, and exact numbers here.