INTERNETARCHIVE.BAK/admin

From Archiveteam
Jump to navigation Jump to search

Server

The iabak.archiveteam.org server is provided by Kenshin. Closure and db48x are root. db48x set up graphite and some of the web page, and closure set up most of the rest.

Configuration management

Since we want to be able to scale to multiple server instances when the time comes, most (but possibly not all) of the server's configuration is handled by a configuration management system. Closure set this up using propellor (https://propellor.branchable.com) which he also wrote.

Clone his propellor git repository and git checkout the joeyconfig branch, and see the joeyconfig.hs and src/Propellor/Property/SiteSpecific/IABak.hs files for the configuration.

Server scripts

The iabak git repository (https://github.com/ArchiveTeam/IA.BAK) has a "server" branch which contains scripts used on the server. This repository is checked out on the server in /usr/local/IA.BAK/

This is where the code for things like updating the stats on the web page, handling new user registration, and sending shard expiry warning emails, etc lives.

Creating new shards

This assumes you have an account on the server. We're looking for SHARDMASTERS, so step right up..!

Creating a new shard is a four step process:

  • Pick collections to include in the shard
  • Create a shard git repository containing all the files in the items in those collections.
  • Install the shard git repository on the server
  • Update the repolist to include the repository, so clients will begin using it.

Pick collections

Pick some collections that need to be backed up, and have not been backed up before.

We don't have any kind of a list or index of already included collections -- yet -- so make sure existing shards don't already include your collections. On the iabak server, /srv/shard/ contains clones of all the shards git repos, so you can look around in there to check.

For scalability reasons, shards should not have more than around 100,000 files in them. And, it tends to work best for the total size of files in a shard to be in the 1-4 TB range. It takes some guesswork to pick a good combination of collections to meet these targets. Might take a few tries.

Closure generated some candidate shards, which all have a reasonable number of collections in them, but the collections are machine-selected. These are in /var/www/html/candidateshards/ on the server. (If you use one of these lists to create a shard, delete it afterwards to avoid dups.)

Create shard git repository

For this, you will need a clone of the iabak git repository, with the server branch checked out.

You also need to add the file md5_collection_url.txt.pick1.sorted.uniq to the same directory. This is a 21gb file from the Internet Archive's census. There's a copy on the server in /home/joey/IA.BAK/md5_collection_url.txt.pick1.sorted.uniq.

   git clone git@github.com:ArchiveTeam/IA.BAK.git
   cd IA.BAK
   ln -s /home/joey/IA.BAK/md5_collection_url.txt.pick1.sorted.uniq

Now you can run ./mkSHARD to create a shard. It takes two parameters. First parameter is either a file in the format used for the files in /var/www/html/candidateshards/, or the first parameter can just be a list of names of collections. The second parameter is the number of the shard.

So, for example:

   ./mkSHARD "millionbooks pimslibrary Princeton" 12

Or:

   ./mkSHARD /var/www/html/candidateshards/smallestfirst83.lst 12

It will take a while! Eventually, you'll get a SHARDn.git repository created in the current directory. The `git annex info` of the repository will also be displayed. Pay attention to the total size of the shard, and the number of files in it. If it's too big/too small, you can rm -rf the SHARDn.git and try again with a different set of collections.

Install shard git repository =

Still in the same directory, run ./setupshardrepo to install the shard. Its first parameter is the full name of the shard (eg "SHARD12"), and the second parameter is the full path to the shard's git repository.

For example:

   sudo ./setupshardrepo SHARD12 `pwd`/SHARD12.git

TODO: Obviously this needs sudo access, so something needs to be done to give SHARDMASTERs sudo access, or make it not need that..

This create a user named eg SHARD12, and installs the repo in their home directory.

Update repolist

Finally, time to tell clients to use the new shard. In the iabak git repo, check out the master branch, and edit the repolist file. See repolist.README for details about this file. You will probably want to add the new shard in reserve state to start with.

For example:

   shard12 SHARD12@iabak.archiveteam.org:shard12 reserve

Commit and push, and as clients update they will become aware of the shard.

Also, the iabak website will add the shard to the display on its next update.