Difference between revisions of "INTERNETARCHIVE.BAK/git-annex implementation"
Line 19: | Line 19: | ||
</pre> | </pre> | ||
From there, you can run | From there, you can run <code>ia search 'collection:*'</code> to get information on all collections as a JSON array. (It's a big list.) See https://pypi.python.org/pypi/internetarchive#command-line-usage for more information. | ||
= First tasks = | = First tasks = |
Revision as of 23:41, 4 March 2015
This page addresses a git-annex implementation of INTERNETARCHIVE.BAK.
For more information, see http://git-annex.branchable.com/design/iabackup/.
Some quick info on Internet Archive
Data model
IA's data is organized into collections and items. One collection contains many items.
Here's an example collection and item in that collection: https://archive.org/details/archiveteam-fire, https://archive.org/details/proust-panic-download-warc.
Browsing the Internet Archive
In addition to the web interface, you can use the Internet Archive command-line tool. The tool currently requires a Python 2.x installation. If you use pip, run
pip install internetarchive
From there, you can run ia search 'collection:*'
to get information on all collections as a JSON array. (It's a big list.) See https://pypi.python.org/pypi/internetarchive#command-line-usage for more information.
First tasks
<closure> SketchCow: I have to work on git-annex development all day (what a fate), not this, and I'm doing 7drl 24x7 all next week. Some first steps others could do: <closure> - pick a set of around 10 thousand items whose size sums to around 8 TB <closure> - build map from Item to shard. Needs to scale well to 24+ million. sql? <closure> - write ingestion script that takes an item and generates a tarball of its non-derived files. Needs to be able to reproduce the same checksum each time run on an (unmodified) item. I know how to make tar and gz reproducible, BTW <closure> - write client registration backend, which generates the client's ssh private key, git-annex UUID, and sends them to the client (somehow tied to IA library cards?) <closure> - client runtime environment (docker image maybe?) with warrior-like interface <closure> (all that needs to do is configure things and get git-annex running) <closure> could someone wiki that? ta