For more information, see http://git-annex.branchable.com/design/iabackup/.
Some first steps to work on:
- get a list of files, checksums, and urls (done)
- write a script to generate a git-annex repository with 100k files from the list (done)
- write client registration backend, which generates the client's ssh private key, git-annex UUID, and sends them to the client (somehow tied to IA library cards?)
- client runtime environment (docker image maybe?) with warrior-like interface (all that needs to do is configure things and get git-annex running)
git-annex scalability tests
git-annex repo growth test
I made a test repo with 10000 files, added via git annex. After git gc --aggressive, .git/objects/ was 4.3M.
I wanted to see how having a fair number of clients each storing part of that and communicating back what they were doing would scale. So, I made 100 clones of the initial repo, each representing a client.
Then in each client, I picked 300 files at random to download. This means that on average, each file would end up replicated to 3 clients. I ran the downloads one client at a time, so as to not overload my laptop.
Then I had each client sync its git-annex state back up with the origin repo. (Again sequentially.) After this sync, the size of the git objects grew to 24M, gc --aggressive reduced it to 18M.
Next, I wanted to simulate maintenance stage, where clients are doing fsck every month and reporting back about the files they still have. I dummied up the data that would be generated by such a fsck, and ran it in each client (just set location log for each present file to 1). After syncing back to the origin repo, and git gc --aggressive, the size of the git objects grew to 19M, so 1MB per month growth.
Summary: Not much to worry about here. Note that if, after several years, the git-annex info in the repo got too big, git-annex forget can be used to forget old history, and drop it back down to starting levels. This leaves plenty of room to grow; either to 100k files, or to 1000 clients. And this is just simulating one share, of thousands.