For more information, see http://git-annex.branchable.com/design/iabackup/.
Some quick info on Internet Archive
IA's data is organized into collections and items. One collection contains many items. An item contains files of the same type such as multiple MP3 files in an album or a single ISO image file. (A PDF manual and its software should go in separate items.)
Here's an example collection and item in that collection: https://archive.org/details/archiveteam-fire, https://archive.org/details/proust-panic-download-warc.
Browsing the Internet Archive
In addition to the web interface, you can use the Internet Archive command-line tool. The tool currently requires a Python 2.x installation. If you use pip, run
pip install internetarchive
See https://pypi.python.org/pypi/internetarchive#command-line-usage for usage information. If you want to start digging, you might find it useful to issue
ia search 'collection:*'; this'll return summary information for all of IA's items.
Some first steps to work on:
- pick a set of around 10 thousand items whose size sums to around 8 TB
- build map from Item to shard. Needs to scale well to 24+ million. sql?
- write ingestion script that takes an item and generates a tarball of its non-derived files. Needs to be able to reproduce the same checksum each time run on an (unmodified) item. Note that pristine-tar etc show how to do this reproducibly.
- write client registration backend, which generates the client's ssh private key, git-annex UUID, and sends them to the client (somehow tied to IA library cards?)
- client runtime environment (docker image maybe?) with warrior-like interface (all that needs to do is configure things and get git-annex running)