INTERNETARCHIVE.BAK/git-annex implementation

From Archiveteam
Jump to navigation Jump to search

This page addresses a git-annex implementation of INTERNETARCHIVE.BAK.

For more information, see

Some quick info on Internet Archive

Data model

IA's data is organized into collections and items. One collection contains many items. An item contains files of the same type such as multiple MP3 files in an album or a single ISO image file. (A PDF manual and its software should go in separate items.)

Here's an example collection and item in that collection:,

Browsing the Internet Archive

In addition to the web interface, you can use the Internet Archive command-line tool. The tool currently requires a Python 2.x installation. If you use pip, run

pip install internetarchive

See for usage information. If you want to start digging, you might find it useful to issue ia search 'collection:*'; this'll return summary information for all of IA's items.

First tasks

Some first steps to work on:

  • pick a set of around 10 thousand items whose size sums to around 8 TB
  • build map from Item to shard. Needs to scale well to 24+ million. sql?
  • write ingestion script that takes an item and generates a tarball of its non-derived files. Needs to be able to reproduce the same checksum each time run on an (unmodified) item. Note that pristine-tar etc show how to do this reproducibly.
  • write client registration backend, which generates the client's ssh private key, git-annex UUID, and sends them to the client (somehow tied to IA library cards?)
  • client runtime environment (docker image maybe?) with warrior-like interface (all that needs to do is configure things and get git-annex running)

git-annex scalability tests

git-annex repo growth test

I made a test repo with 10000 files, added via git annex. After git gc --aggressive, .git/objects/ was 4.3M.

I wanted to see how having a fair number of clients each storing part of that and communicating back what they were doing would scale. So, I made 100 clones of the initial repo, each representing a client.

Then in each client, I picked 300 files at random to download. This means that on average, each file would end up replicated to 3 clients. (A few clients ended up with more, up to 600 files per client.)

I ran the downloads one client at a time, so as to not overload my laptop.

Then the interesting bit, I had each client sync its git-annex state back up with the origin repo. (Again sequentially.)

After this sync, the size of the git objects grew to 24M, gc --aggressive reduced it to 17M.

Next, I wanted to simulate maintenance stage, where clients are doing incremental fsck every month and reporting back about the files they still have.

I dummied up the data that would be generated by such a fsck, and ran it in each client (just set location log for each present file to 1).

After syncing back to the origin repo, and git gc --aggressive, the size of the git objects grew to 18M, so 1MB per month growth.

Summary: Not much to worry about here. Note that if, after several years, the git-annex info in the repo got too big, git-annex forget can be used to forget old history, and drop it back down to starting levels. This leaves plenty of room to grow; either to 100k files, or to 1000 clients. And this is just simulating one share, of thousands.