INTERNETARCHIVE.BAK/git-annex implementation

From Archiveteam
Jump to navigation Jump to search

This page addresses a git-annex implementation of INTERNETARCHIVE.BAK.

For more information, see

First tasks

Some first steps to work on:

  • Get a list of files, checksums, and urls. (done)
  • Write a script to generate a git-annex repository with 100k files from the list. (done)
  • Set up a server to serve up the git repos. Any linux system with a few hundred gb of disk and ssh and git-annex installed will do. It needs to accept incoming ssh connections from registered clients, only letting them run git-annex-shell. (done)
  • Put one shard repo on the server to start. (done)
  • Manually register a few clients to start, have them manually download some files, and `git annex sync` their state back to the server. See how it all hangs together. (in progress)
  • Get that first shard backed up enough to be able to say, "we have successfully backed up 1/1770th of the IA!"

Middle tasks

  • Test a restore of that first shard. Tell git-annex the content is no longer in the IA. Get the clients to upload it to our server.

Later tasks

  • Create all 1770 shards, and see how that scales.
  • Write pre-receive git hook, to reject pushes of branches other then the git-annex branch, and probably do other checks for bad/malicious pushes.
  • Write client registration backend, which generates the client's ssh private key, git-annex UUID, and sends them to the client (somehow tied to IA library cards?)
  • Client runtime environment (docker image maybe?) with warrior-like interface (all that needs to do is configure things and get git-annex running)

demo shard

This git-annex repository contains 100k files, the entire collections "internetarchivebooks" and "usenethistorical".

git clone SHARD1@

You need to get your .ssh/ added to be able to access this. Ask on IRC for now..

To play with this, just git clone it, and then you can run `git annex get` to download files, or `git annex whereis` to show where git-annex knows a file is.

After you have downloaded some files, let the central repo know you're backing them up by running: `git annex sync`

`git annex status` shows some stats about the files this repository is tracking:

annexed files in working tree: 103343

size of annexed files in working tree: 2.91 terabytes

The size of the git repository itself is 51 megabytes.

Note that due to the IA census using md5sums, you need git-annex version 5.20150205 to run git-annex fsck in this repository.

Older verisons of git-annex will work for everything else, but not fsck.

One way to install that version is

git-annex scalability tests

git-annex repo growth test

I made a test repo with 10000 files, added via git annex. After git gc --aggressive, .git/objects/ was 4.3M.

I wanted to see how having a fair number of clients each storing part of that and communicating back what they were doing would scale. So, I made 100 clones of the initial repo, each representing a client.

Then in each client, I picked 300 files at random to download. This means that on average, each file would end up replicated to 3 clients. I ran the downloads one client at a time, so as to not overload my laptop.

Then I had each client sync its git-annex state back up with the origin repo. (Again sequentially.) After this sync, the size of the git objects grew to 24M, gc --aggressive reduced it to 18M.

Next, I wanted to simulate maintenance stage, where clients are doing fsck every month and reporting back about the files they still have. I dummied up the data that would be generated by such a fsck, and ran it in each client (just set location log for each present file to 1). After syncing back to the origin repo, and git gc --aggressive, the size of the git objects grew to 19M, so 1MB per month growth.

Summary: Not much to worry about here. Note that if, after several years, the git-annex info in the repo got too big, git-annex forget can be used to forget old history, and drop it back down to starting levels. This leaves plenty of room to grow; either to 100k files, or to 1000 clients. And this is just simulating one share, of thousands.