Revision as of 20:09, 26 March 2015

This page addresses a git-annex implementation of INTERNETARCHIVE.BAK.

For more information, see http://git-annex.branchable.com/design/iabackup/

First tasks

Some first steps to work on:

Get a list of files, checksums, and urls. (done)
Write a script to generate a git-annex repository with 100k files from the list. (done)
Set up a server to serve up the git repos. Any linux system with a few hundred gb of disk and ssh and git-annex installed will do. It needs to accept incoming ssh connections from registered clients, only letting them run git-annex-shell. (done)
Put one shard repo on the server to start. (done)
Manually register a few clients to start, have them manually download some files, and `git annex sync` their state back to the server. See how it all hangs together. (in progress)
Get that first shard backed up enough to be able to say, "we have successfully backed up 1/1770th of the IA!"

Middle tasks

Test a restore of that first shard. Tell git-annex the content is no longer in the IA. Get the clients to upload it to our server.

Later tasks

Create all 1770 shards, and see how that scales.
Write pre-receive git hook, to reject pushes of branches other then the git-annex branch, and probably do other checks for bad/malicious pushes.
Write client registration backend, which generates the client's ssh private key, git-annex UUID, and sends them to the client (somehow tied to IA library cards?)
Client runtime environment (docker image maybe?) with warrior-like interface (all that needs to do is configure things and get git-annex running)

demo shard

This git-annex repository contains 100k files, the entire collections "internetarchivebooks" and "usenethistorical".

git clone SHARD1@124.6.40.227:shard1

You need to get your .ssh/id_rsa.pub added to be able to access this. Ask on IRC for now..

To play with this, just git clone it, and then you can run `git annex get` to download files, or `git annex whereis` to show where git-annex knows a file is.

After you have downloaded some files, let the central repo know you're backing them up by running: `git annex sync`

`git annex status` shows some stats about the files this repository is tracking:

annexed files in working tree: 103343

size of annexed files in working tree: 2.91 terabytes

The size of the git repository itself is 51 megabytes.

Note that due to the IA census using md5sums, you need git-annex version 5.20150205 to run git-annex fsck in this repository.

Older verisons of git-annex will work for everything else, but not fsck.

One way to install that version is https://git-annex.branchable.com/install/Linux_standalone/

git-annex scalability tests

git-annex repo growth test

I made a test repo with 10000 files, added via git annex. After git gc --aggressive, .git/objects/ was 4.3M.

I wanted to see how having a fair number of clients each storing part of that and communicating back what they were doing would scale. So, I made 100 clones of the initial repo, each representing a client.

Then in each client, I picked 300 files at random to download. This means that on average, each file would end up replicated to 3 clients. I ran the downloads one client at a time, so as to not overload my laptop.

Then I had each client sync its git-annex state back up with the origin repo. (Again sequentially.) After this sync, the size of the git objects grew to 24M, gc --aggressive reduced it to 18M.

Next, I wanted to simulate maintenance stage, where clients are doing fsck every month and reporting back about the files they still have. I dummied up the data that would be generated by such a fsck, and ran it in each client (just set location log for each present file to 1). After syncing back to the origin repo, and git gc --aggressive, the size of the git objects grew to 19M, so 1MB per month growth.

Summary: Not much to worry about here. Note that if, after several years, the git-annex info in the repo got too big, git-annex forget can be used to forget old history, and drop it back down to starting levels. This leaves plenty of room to grow; either to 100k files, or to 1000 clients. And this is just simulating one share, of thousands.

Script: http://tmp.kitenet.net/git-annex-growth-test.sh

@@ Line 9: / Line 9: @@
 * Get a list of files, checksums, and urls. (done)
 * Write a script to generate a git-annex repository with 100k files from the list. (done)
-* Set up a server to serve up the git repos. Any linux system with a few hundred gb of disk and ssh and git-annex installed will do. It needs to accept incoming ssh connections from registered clients, only letting them run git-annex-shell.
+* Set up a server to serve up the git repos. Any linux system with a few hundred gb of disk and ssh and git-annex installed will do. It needs to accept incoming ssh connections from registered clients, only letting them run git-annex-shell. (done)
-* Put one shard repo on the server to start.
+* Put one shard repo on the server to start. (done)
-* Manually register a few clients to start, have them manually download some files, and `git annex sync` their state back to the server. See how it all hangs together.
+* Manually register a few clients to start, have them manually download some files, and `git annex sync` their state back to the server. See how it all hangs together. (in progress)
 * Get that first shard backed up enough to be able to say, "we have successfully backed up 1/1770th of the IA!"

Difference between revisions of "INTERNETARCHIVE.BAK/git-annex implementation"

Revision as of 20:09, 26 March 2015

Contents

First tasks

Middle tasks

Later tasks

demo shard

git-annex scalability tests

git-annex repo growth test

Navigation menu

Difference between revisions of "INTERNETARCHIVE.BAK/git-annex implementation"

Revision as of 20:09, 26 March 2015

First tasks

Middle tasks

Later tasks

demo shard

git-annex scalability tests

git-annex repo growth test

Navigation menu

Search