Difference between revisions of "INTERNETARCHIVE.BAK/git-annex implementation"

From Archiveteam
Jump to navigation Jump to search
m (Reverted edits by Megalanya1 (talk) to last revision by Jscott)
 
(47 intermediate revisions by 8 users not shown)
Line 1: Line 1:
This page addresses a [https://git-annex.branchable.com git-annex] implementation of [[INTERNETARCHIVE.BAK]].
This page addresses a [https://git-annex.branchable.com git-annex] implementation of [[INTERNETARCHIVE.BAK]].


For more information, see http://git-annex.branchable.com/design/iabackup/.
= Quickstart =


= Some quick info on Internet Archive =
Do this on the drive you want to use:


== Data model ==
<pre>
$ git clone https://github.com/ArchiveTeam/IA.BAK
$ cd IA.BAK
$ ./iabak
</pre>


IA's data is organized into ''collections'' and ''items''.  One collection contains many items. An item contains files of the same type such as multiple MP3 files in an album or a single ISO image file. (A PDF manual and its software should go in separate items.)
It will walk you through setup and starting to download files, and install a cron job (or .timer unit) to perform periodic maintenance.


Here's an example collection and item in that collection: https://archive.org/details/archiveteam-fire, https://archive.org/details/proust-panic-download-warc.
It should prompt you for how much disk space to not use. To adjust this value later, use <code>git config annex.diskreserve 200GB</code> in all of the <code>IA.BAK/shard*</code> directories.


== Browsing the Internet Archive ==
Configuration and maintenance information can be found in the README.md file. (Also available at https://github.com/ArchiveTeam/IA.BAK/#readme)


In addition to the web interface, you can use the [https://pypi.python.org/pypi/internetarchive Internet Archive command-line tool].  The tool currently requires a Python 2.x installation.  If you use pip, run
=== Dependencies ===
* sane UNIX environment (shell, df, perl, grep)
* git
* crontab OR systemd (NOTE: you may need to run <code>loginctl enable-linger <user></code> to make sure the job is not killed)
* <code>shuf</code> (optional - will randomize the order you download files in)


<pre>
= Status =
pip install internetarchive
</pre>


See https://pypi.python.org/pypi/internetarchive#command-line-usage for usage information.  If you want to start digging, you might find it useful to issue <code>ia search 'collection:*'</code>; this'll return summary information for all of IA's items.
[http://iabak.archiveteam.org/ Graphs of status]


= First tasks =
[http://iabak.archiveteam.org/stats/ raw data]


Some first steps to work on:
= Implementation plan =


* pick a set of around 10 thousand items whose size sums to around 8 TB
For more information, see http://git-annex.branchable.com/design/iabackup/
* build map from Item to shard. Needs to scale well to 24+ million. sql?
* write ingestion script that takes an item and generates a tarball of its non-derived files. Needs to be able to reproduce the same checksum each time run on an (unmodified) item. Note that pristine-tar etc show how to do this reproducibly.
* write client registration backend, which generates the client's ssh private key, git-annex UUID, and sends them to the client (somehow tied to IA library cards?)
* client runtime environment (docker image maybe?) with warrior-like interface (all that needs to do is configure things and get git-annex running)


= git-annex scalability tests =
== First tasks ==


== git-annex repo growth test ==
Some first steps to work on:


I made a test repo with 10000 files, added via git annex. After git gc --aggressive, .git/objects/ was 4.3M.
* Get a list of files, checksums, and urls. (done)
* Write a script to generate a git-annex repository with 100k files from the list. (done)
* Set up a server to serve up the git repos. Any linux system with a few hundred gb of disk and ssh and git-annex installed will do. It needs to accept incoming ssh connections from registered clients, only letting them run git-annex-shell. (done)
* Put one shard repo on the server to start. (done)
* Manually register a few clients to start, have them manually download some files, and `git annex sync` their state back to the server. See how it all hangs together. (done)
* Get that first shard backed up enough to be able to say, "we have successfully backed up 1/1770th of the IA!" (done!)


I wanted to see how having a fair number of clients each storing part of that and communicating back what they were doing would scale. So, I made 100 clones of the initial repo, each representing a client.
== Middle tasks ==


Then in each client, I picked 300 files at random to download. This means that on average, each file would end up replicated to 3 clients. (A few clients ended up with more, up to 600 files per client.)
* get fscking and dead client expiry working (done)
* Test a restore from a shard. Tell git-annex the content is no longer in the IA. Get the clients to upload it to our server.
* Write client registration interface, which generates the client's ssh private key, git-annex UUID, and sends them to the client (done)
* Help the user get the iabak-cronjob set up.
* Email expire warnings (done)


I ran the downloads one client at a time, so as to not overload my laptop.
== Later tasks ==


Then the interesting bit, I had each client sync its git-annex state back up with the origin repo. (Again sequentially.)
* Create all 1770 shards, and see how that scales.
* Write pre-receive git hook, to reject pushes of branches other then the git-annex branch (already done), and prevent bad/malicious pushes of the git-annex branch
* Client runtime environment (docker image maybe?) with warrior-like interface (all that needs to do is configure things and get git-annex running)


After this sync, the size of the git objects grew to 24M, gc --aggressive reduced it to 17M.
== SHARD1 ==


Next, I wanted to simulate maintenance stage, where clients are doing incremental fsck every month and reporting back about the files they still have.
This is our first part of the IA that we want to get backed up. If we succeed, we will have backed up 1/1770th of the Internet Archive.
This git-annex repository contains 100k files, the entire collections "internetarchivebooks" and "usenethistorical".


I dummied up the data that would be generated by such a fsck, and ran it in each client (just set location log for each present file to 1).
Some stats about the files this repository is tracking:
* number of files: 103343
* total file size: 2.91 terabytes
* size of the git repository itself was 51 megabytes to start
* after filling up shard1, the git repo had grown to 196 mb
* We aimed for 4 copies of every file downloaded, but a few files got 5-8 copies made, due to eg, races and manual downloads. Want to keep an eye on this with future shards.
* We got SHARD1 fully downloaded between April 1-6th. It took a while to ramp up as people came in, so later shards may download faster. Also, 2/3 of SHARD2 was downloaded during this same time period.


After syncing back to the origin repo, and git gc --aggressive, the size of the git objects grew to 18M, so 1MB per month growth.
== Admin details ==


Summary: Not much to worry about here. Note that if, after several years, the git-annex info in the repo got too big, git-annex forget can be used to forget old history, and drop it back down to starting levels. This leaves plenty of room to grow; either to 100k files, or to 1000 clients. And this is just simulating one share, of thousands.
See [[INTERNETARCHIVE.BAK/admin]].

Latest revision as of 16:32, 17 January 2017

This page addresses a git-annex implementation of INTERNETARCHIVE.BAK.

Quickstart

Do this on the drive you want to use:

$ git clone https://github.com/ArchiveTeam/IA.BAK
$ cd IA.BAK
$ ./iabak

It will walk you through setup and starting to download files, and install a cron job (or .timer unit) to perform periodic maintenance.

It should prompt you for how much disk space to not use. To adjust this value later, use git config annex.diskreserve 200GB in all of the IA.BAK/shard* directories.

Configuration and maintenance information can be found in the README.md file. (Also available at https://github.com/ArchiveTeam/IA.BAK/#readme)

Dependencies

  • sane UNIX environment (shell, df, perl, grep)
  • git
  • crontab OR systemd (NOTE: you may need to run loginctl enable-linger <user> to make sure the job is not killed)
  • shuf (optional - will randomize the order you download files in)

Status

Graphs of status

raw data

Implementation plan

For more information, see http://git-annex.branchable.com/design/iabackup/

First tasks

Some first steps to work on:

  • Get a list of files, checksums, and urls. (done)
  • Write a script to generate a git-annex repository with 100k files from the list. (done)
  • Set up a server to serve up the git repos. Any linux system with a few hundred gb of disk and ssh and git-annex installed will do. It needs to accept incoming ssh connections from registered clients, only letting them run git-annex-shell. (done)
  • Put one shard repo on the server to start. (done)
  • Manually register a few clients to start, have them manually download some files, and `git annex sync` their state back to the server. See how it all hangs together. (done)
  • Get that first shard backed up enough to be able to say, "we have successfully backed up 1/1770th of the IA!" (done!)

Middle tasks

  • get fscking and dead client expiry working (done)
  • Test a restore from a shard. Tell git-annex the content is no longer in the IA. Get the clients to upload it to our server.
  • Write client registration interface, which generates the client's ssh private key, git-annex UUID, and sends them to the client (done)
  • Help the user get the iabak-cronjob set up.
  • Email expire warnings (done)

Later tasks

  • Create all 1770 shards, and see how that scales.
  • Write pre-receive git hook, to reject pushes of branches other then the git-annex branch (already done), and prevent bad/malicious pushes of the git-annex branch
  • Client runtime environment (docker image maybe?) with warrior-like interface (all that needs to do is configure things and get git-annex running)

SHARD1

This is our first part of the IA that we want to get backed up. If we succeed, we will have backed up 1/1770th of the Internet Archive. This git-annex repository contains 100k files, the entire collections "internetarchivebooks" and "usenethistorical".

Some stats about the files this repository is tracking:

  • number of files: 103343
  • total file size: 2.91 terabytes
  • size of the git repository itself was 51 megabytes to start
  • after filling up shard1, the git repo had grown to 196 mb
  • We aimed for 4 copies of every file downloaded, but a few files got 5-8 copies made, due to eg, races and manual downloads. Want to keep an eye on this with future shards.
  • We got SHARD1 fully downloaded between April 1-6th. It took a while to ramp up as people came in, so later shards may download faster. Also, 2/3 of SHARD2 was downloaded during this same time period.

Admin details

See INTERNETARCHIVE.BAK/admin.