Difference between revisions of "INTERNETARCHIVE.BAK/git-annex implementation"

From Archiveteam
Jump to navigation Jump to search
Line 2: Line 2:


For more information, see http://git-annex.branchable.com/design/iabackup/.
For more information, see http://git-annex.branchable.com/design/iabackup/.
= Some quick info on Internet Archive =
== Data model ==
IA's data is organized into ''collections'' and ''items''.  One collection contains many items. An item contains files of the same type such as multiple MP3 files in an album or a single ISO image file. (A PDF manual and its software should go in separate items.)
Here's an example collection and item in that collection: https://archive.org/details/archiveteam-fire, https://archive.org/details/proust-panic-download-warc.
== Browsing the Internet Archive ==
In addition to the web interface, you can use the [https://pypi.python.org/pypi/internetarchive Internet Archive command-line tool].  The tool currently requires a Python 2.x installation.  If you use pip, run
<pre>
pip install internetarchive
</pre>
See https://pypi.python.org/pypi/internetarchive#command-line-usage for usage information.  If you want to start digging, you might find it useful to issue <code>ia search 'collection:*'</code>; this'll return summary information for all of IA's items.


= First tasks =
= First tasks =
Line 39: Line 21:
I wanted to see how having a fair number of clients each storing part of that and communicating back what they were doing would scale. So, I made 100 clones of the initial repo, each representing a client.
I wanted to see how having a fair number of clients each storing part of that and communicating back what they were doing would scale. So, I made 100 clones of the initial repo, each representing a client.


Then in each client, I picked 300 files at random to download. This means that on average, each file would end up replicated to 3 clients. (A few clients ended up with more, up to 600 files per client.)
Then in each client, I picked 300 files at random to download. This means that on average, each file would end up replicated to 3 clients. I ran the downloads one client at a time, so as to not overload my laptop.
 
I ran the downloads one client at a time, so as to not overload my laptop.
 
Then the interesting bit, I had each client sync its git-annex state back up with the origin repo. (Again sequentially.)
 
After this sync, the size of the git objects grew to 24M, gc --aggressive reduced it to 17M.


Next, I wanted to simulate maintenance stage, where clients are doing incremental fsck every month and reporting back about the files they still have.
Then I had each client sync its git-annex state back up with the origin repo. (Again sequentially.)
After this sync, the size of the git objects grew to 24M, gc --aggressive reduced it to 18M.


Next, I wanted to simulate maintenance stage, where clients are doing fsck every month and reporting back about the files they still have.
I dummied up the data that would be generated by such a fsck, and ran it in each client (just set location log for each present file to 1).
I dummied up the data that would be generated by such a fsck, and ran it in each client (just set location log for each present file to 1).
After syncing back to the origin repo, and git gc --aggressive, the size of the git objects grew to 19M, so 1MB per month growth.


After syncing back to the origin repo, and git gc --aggressive, the size of the git objects grew to 18M, so 1MB per month growth.
Summary: Not much to worry about here. Note that if, after several years, the git-annex info in the repo got too big, git-annex forget can be used to forget old history, and drop it back down to starting levels. This leaves plenty of room to grow; either to 100k files, or to 1000 clients. And this is just simulating one share, of thousands.


Summary: Not much to worry about here. Note that if, after several years, the git-annex info in the repo got too big, git-annex forget can be used to forget old history, and drop it back down to starting levels. This leaves plenty of room to grow; either to 100k files, or to 1000 clients. And this is just simulating one share, of thousands.
Script: http://tmp.kitenet.net/git-annex-growth-test.sh

Revision as of 08:07, 6 March 2015

This page addresses a git-annex implementation of INTERNETARCHIVE.BAK.

For more information, see http://git-annex.branchable.com/design/iabackup/.

First tasks

Some first steps to work on:

  • pick a set of around 10 thousand items whose size sums to around 8 TB
  • build map from Item to shard. Needs to scale well to 24+ million. sql?
  • write ingestion script that takes an item and generates a tarball of its non-derived files. Needs to be able to reproduce the same checksum each time run on an (unmodified) item. Note that pristine-tar etc show how to do this reproducibly.
  • write client registration backend, which generates the client's ssh private key, git-annex UUID, and sends them to the client (somehow tied to IA library cards?)
  • client runtime environment (docker image maybe?) with warrior-like interface (all that needs to do is configure things and get git-annex running)

git-annex scalability tests

git-annex repo growth test

I made a test repo with 10000 files, added via git annex. After git gc --aggressive, .git/objects/ was 4.3M.

I wanted to see how having a fair number of clients each storing part of that and communicating back what they were doing would scale. So, I made 100 clones of the initial repo, each representing a client.

Then in each client, I picked 300 files at random to download. This means that on average, each file would end up replicated to 3 clients. I ran the downloads one client at a time, so as to not overload my laptop.

Then I had each client sync its git-annex state back up with the origin repo. (Again sequentially.) After this sync, the size of the git objects grew to 24M, gc --aggressive reduced it to 18M.

Next, I wanted to simulate maintenance stage, where clients are doing fsck every month and reporting back about the files they still have. I dummied up the data that would be generated by such a fsck, and ran it in each client (just set location log for each present file to 1). After syncing back to the origin repo, and git gc --aggressive, the size of the git objects grew to 19M, so 1MB per month growth.

Summary: Not much to worry about here. Note that if, after several years, the git-annex info in the repo got too big, git-annex forget can be used to forget old history, and drop it back down to starting levels. This leaves plenty of room to grow; either to 100k files, or to 1000 clients. And this is just simulating one share, of thousands.

Script: http://tmp.kitenet.net/git-annex-growth-test.sh