Difference between revisions of "INTERNETARCHIVE.BAK/git-annex implementation"

From Archiveteam
Jump to navigation Jump to search
m (Reverted edits by Megalanya1 (talk) to last revision by Jscott)
 
(25 intermediate revisions by 7 users not shown)
Line 1: Line 1:
 
This page addresses a [https://git-annex.branchable.com git-annex] implementation of [[INTERNETARCHIVE.BAK]].
 
This page addresses a [https://git-annex.branchable.com git-annex] implementation of [[INTERNETARCHIVE.BAK]].
 +
 +
= Quickstart =
 +
 +
Do this on the drive you want to use:
 +
 +
<pre>
 +
$ git clone https://github.com/ArchiveTeam/IA.BAK
 +
$ cd IA.BAK
 +
$ ./iabak
 +
</pre>
 +
 +
It will walk you through setup and starting to download files, and install a cron job (or .timer unit) to perform periodic maintenance.
 +
 +
It should prompt you for how much disk space to not use. To adjust this value later, use <code>git config annex.diskreserve 200GB</code> in all of the <code>IA.BAK/shard*</code> directories.
 +
 +
Configuration and maintenance information can be found in the README.md file. (Also available at https://github.com/ArchiveTeam/IA.BAK/#readme)
 +
 +
=== Dependencies ===
 +
* sane UNIX environment (shell, df, perl, grep)
 +
* git
 +
* crontab OR systemd (NOTE: you may need to run <code>loginctl enable-linger <user></code> to make sure the job is not killed)
 +
* <code>shuf</code> (optional - will randomize the order you download files in)
 +
 +
= Status =
 +
 +
[http://iabak.archiveteam.org/ Graphs of status]
 +
 +
[http://iabak.archiveteam.org/stats/ raw data]
 +
 +
= Implementation plan =
  
 
For more information, see http://git-annex.branchable.com/design/iabackup/
 
For more information, see http://git-annex.branchable.com/design/iabackup/
  
= First tasks =
+
== First tasks ==
  
 
Some first steps to work on:
 
Some first steps to work on:
Line 11: Line 41:
 
* Set up a server to serve up the git repos. Any linux system with a few hundred gb of disk and ssh and git-annex installed will do. It needs to accept incoming ssh connections from registered clients, only letting them run git-annex-shell. (done)
 
* Set up a server to serve up the git repos. Any linux system with a few hundred gb of disk and ssh and git-annex installed will do. It needs to accept incoming ssh connections from registered clients, only letting them run git-annex-shell. (done)
 
* Put one shard repo on the server to start. (done)
 
* Put one shard repo on the server to start. (done)
* Manually register a few clients to start, have them manually download some files, and `git annex sync` their state back to the server. See how it all hangs together. (in progress)
+
* Manually register a few clients to start, have them manually download some files, and `git annex sync` their state back to the server. See how it all hangs together. (done)
* Get that first shard backed up enough to be able to say, "we have successfully backed up 1/1770th of the IA!"
+
* Get that first shard backed up enough to be able to say, "we have successfully backed up 1/1770th of the IA!" (done!)
  
= Middle tasks =
+
== Middle tasks ==
  
* Test a restore of that first shard. Tell git-annex the content is no longer in the IA. Get the clients to upload it to our server.
+
* get fscking and dead client expiry working (done)
 +
* Test a restore from a shard. Tell git-annex the content is no longer in the IA. Get the clients to upload it to our server.
 +
* Write client registration interface, which generates the client's ssh private key, git-annex UUID, and sends them to the client (done)
 +
* Help the user get the iabak-cronjob set up.
 +
* Email expire warnings (done)
  
= Later tasks =
+
== Later tasks ==
  
 
* Create all 1770 shards, and see how that scales.
 
* Create all 1770 shards, and see how that scales.
 
* Write pre-receive git hook, to reject pushes of branches other then the git-annex branch (already done), and prevent bad/malicious pushes of the git-annex branch
 
* Write pre-receive git hook, to reject pushes of branches other then the git-annex branch (already done), and prevent bad/malicious pushes of the git-annex branch
* Write client registration interface, which generates the client's ssh private key, git-annex UUID, and sends them to the client
 
 
* Client runtime environment (docker image maybe?) with warrior-like interface (all that needs to do is configure things and get git-annex running)
 
* Client runtime environment (docker image maybe?) with warrior-like interface (all that needs to do is configure things and get git-annex running)
  
= shard1 =
+
== SHARD1 ==
  
 
This is our first part of the IA that we want to get backed up. If we succeed, we will have backed up 1/1770th of the Internet Archive.
 
This is our first part of the IA that we want to get backed up. If we succeed, we will have backed up 1/1770th of the Internet Archive.
 
This git-annex repository contains 100k files, the entire collections "internetarchivebooks" and "usenethistorical".
 
This git-annex repository contains 100k files, the entire collections "internetarchivebooks" and "usenethistorical".
 
To help backing up shard1, checkout this git repository: <https://github.com/ArchiveTeam/IA.BAK>
 
 
The iabak script will set you up and get you downloading files from the IA into your backup drive.
 
  
 
Some stats about the files this repository is tracking:
 
Some stats about the files this repository is tracking:
 
* number of files: 103343
 
* number of files: 103343
 
* total file size: 2.91 terabytes
 
* total file size: 2.91 terabytes
* size of the git repository itself is 51 megabytes
+
* size of the git repository itself was 51 megabytes to start
 
+
* after filling up shard1, the git repo had grown to 196 mb
== tuning your repo ==
+
* We aimed for 4 copies of every file downloaded, but a few files got 5-8 copies made, due to eg, races and manual downloads. Want to keep an eye on this with future shards.
 
+
* We got SHARD1 fully downloaded between April 1-6th. It took a while to ramp up as people came in, so later shards may download faster. Also, 2/3 of SHARD2 was downloaded during this same time period.
So you want to back up part of the IA, but don't want this to take over your whole disk or internet pipe? Here's some tuning options you can use.. Run these commands in the git repo you checked out.
 
 
 
'''git config annex.diskreserve 200GB'''
 
 
 
This will prevent git-annex from using up the last 200gb of your disk. Adjust to suite.
 
 
 
'''git config annex.web-options=--limit-rate=200k'''
 
 
 
This will limit wget/curl to downloading at 200 kb/s. Adjust to suite.
 
 
 
== Ubuntu setup ==
 
 
 
First, create an ssh key, run ssh-keygen and send the id-rsa.pub to Closure (irc!)
 
 
 
Next, add the ppa for git-annex, the default git-annex is too old. https://launchpad.net/~jtgeibel/+archive/ubuntu/ppa?field.series_filter=trusty
 
  
Run git-annex version and make sure you are at git-annex version: 5.20150327~ubuntu14.04.1~ppa3 or above.
+
== Admin details ==
  
now run git clone SHARD1@124.6.40.227:shard1 on the location you want to save this data and run git annex get in a screen or tmux.
+
See [[INTERNETARCHIVE.BAK/admin]].

Latest revision as of 16:32, 17 January 2017

This page addresses a git-annex implementation of INTERNETARCHIVE.BAK.

Quickstart

Do this on the drive you want to use:

$ git clone https://github.com/ArchiveTeam/IA.BAK
$ cd IA.BAK
$ ./iabak

It will walk you through setup and starting to download files, and install a cron job (or .timer unit) to perform periodic maintenance.

It should prompt you for how much disk space to not use. To adjust this value later, use git config annex.diskreserve 200GB in all of the IA.BAK/shard* directories.

Configuration and maintenance information can be found in the README.md file. (Also available at https://github.com/ArchiveTeam/IA.BAK/#readme)

Dependencies

  • sane UNIX environment (shell, df, perl, grep)
  • git
  • crontab OR systemd (NOTE: you may need to run loginctl enable-linger <user> to make sure the job is not killed)
  • shuf (optional - will randomize the order you download files in)

Status

Graphs of status

raw data

Implementation plan

For more information, see http://git-annex.branchable.com/design/iabackup/

First tasks

Some first steps to work on:

  • Get a list of files, checksums, and urls. (done)
  • Write a script to generate a git-annex repository with 100k files from the list. (done)
  • Set up a server to serve up the git repos. Any linux system with a few hundred gb of disk and ssh and git-annex installed will do. It needs to accept incoming ssh connections from registered clients, only letting them run git-annex-shell. (done)
  • Put one shard repo on the server to start. (done)
  • Manually register a few clients to start, have them manually download some files, and `git annex sync` their state back to the server. See how it all hangs together. (done)
  • Get that first shard backed up enough to be able to say, "we have successfully backed up 1/1770th of the IA!" (done!)

Middle tasks

  • get fscking and dead client expiry working (done)
  • Test a restore from a shard. Tell git-annex the content is no longer in the IA. Get the clients to upload it to our server.
  • Write client registration interface, which generates the client's ssh private key, git-annex UUID, and sends them to the client (done)
  • Help the user get the iabak-cronjob set up.
  • Email expire warnings (done)

Later tasks

  • Create all 1770 shards, and see how that scales.
  • Write pre-receive git hook, to reject pushes of branches other then the git-annex branch (already done), and prevent bad/malicious pushes of the git-annex branch
  • Client runtime environment (docker image maybe?) with warrior-like interface (all that needs to do is configure things and get git-annex running)

SHARD1

This is our first part of the IA that we want to get backed up. If we succeed, we will have backed up 1/1770th of the Internet Archive. This git-annex repository contains 100k files, the entire collections "internetarchivebooks" and "usenethistorical".

Some stats about the files this repository is tracking:

  • number of files: 103343
  • total file size: 2.91 terabytes
  • size of the git repository itself was 51 megabytes to start
  • after filling up shard1, the git repo had grown to 196 mb
  • We aimed for 4 copies of every file downloaded, but a few files got 5-8 copies made, due to eg, races and manual downloads. Want to keep an eye on this with future shards.
  • We got SHARD1 fully downloaded between April 1-6th. It took a while to ramp up as people came in, so later shards may download faster. Also, 2/3 of SHARD2 was downloaded during this same time period.

Admin details

See INTERNETARCHIVE.BAK/admin.