Revision as of 23:36, 4 March 2015

This page addresses a git-annex implementation of INTERNETARCHIVE.BAK.

For more information, see http://git-annex.branchable.com/design/iabackup/.

Internet Archive's structure

IA's data is organized into collections and items. One collection contains many items.

Here's an example collection and item in that collection: https://archive.org/details/archiveteam-fire, https://archive.org/details/proust-panic-download-warc.

First tasks

<closure> SketchCow: I have to work on git-annex development all day (what a fate), not this, and I'm doing 7drl 24x7 all next week. Some first steps others could do:
<closure> - pick a set of around 10 thousand items whose size sums to around 8 TB
<closure> - build map from Item to shard. Needs to scale well to 24+ million. sql?
<closure> - write ingestion script that takes an item and generates a tarball of its non-derived files. Needs to be able to reproduce the same checksum each time run on an (unmodified) item. I know how to make tar and gz reproducible, BTW
<closure> - write client registration backend, which generates the client's ssh private key, git-annex UUID, and sends them to the client (somehow tied to IA library cards?)
<closure> - client runtime environment (docker image maybe?) with warrior-like interface
<closure> (all that needs to do is configure things and get git-annex running)
<closure> could someone wiki that? ta

@@ Line 5: / Line 5: @@
 = Internet Archive's structure =
-IA's data is organized into _collections_ and _items_; one collection contains many items.
+IA's data is organized into ''collections'' and ''items''.  One collection contains many items.
-Here's an example collection: https://archive.org/details/archiveteam-fire
+Here's an example collection and item in that collection: https://archive.org/details/archiveteam-fire, https://archive.org/details/proust-panic-download-warc.
-...and here's an item in that collection: https://archive.org/details/proust-panic-download-warc
 = First tasks =

Difference between revisions of "INTERNETARCHIVE.BAK/git-annex implementation"

Revision as of 23:36, 4 March 2015

Internet Archive's structure

First tasks

Navigation menu

Difference between revisions of "INTERNETARCHIVE.BAK/git-annex implementation"

Revision as of 23:36, 4 March 2015

Internet Archive's structure

First tasks

Navigation menu

Search