Difference between revisions of "INTERNETARCHIVE.BAK/admin"

From Archiveteam
Jump to navigation Jump to search
m
Line 9: Line 9:
Clone his propellor git repository and git checkout the joeyconfig branch, and see the joeyconfig.hs and src/Propellor/Property/SiteSpecific/IABak.hs files for the configuration.
Clone his propellor git repository and git checkout the joeyconfig branch, and see the joeyconfig.hs and src/Propellor/Property/SiteSpecific/IABak.hs files for the configuration.


== Server scripts ==  
=== Server scripts ===


The iabak git repository (https://github.com/ArchiveTeam/IA.BAK) has a "server" branch which contains scripts used on the server. This repository is checked out on the server in /usr/local/IA.BAK/
The iabak git repository (https://github.com/ArchiveTeam/IA.BAK) has a "server" branch which contains scripts used on the server. This repository is checked out on the server in /usr/local/IA.BAK/
Line 82: Line 82:


Also, the iabak website will add the shard to the display on its next update.
Also, the iabak website will add the shard to the display on its next update.
== Adjusting repolist states ==
From time to time, shards get sufficiently backed up that they no longer need to be marked as active in the repolist, and can be set to maint. Or, a shard in maint may lose redundancy, and need to go back to active to get some more clients to use it. We generally want 2-3 shards in active mode at a time, and the rest in maint, with a few new shards in reserve. See the stats on the website to know when changes need to be made. Then just edit the iabak repository's repolist file, and commit it.
== Trimming unavailable files ==
Sometimes a shard will get almost all files backed up to enough clients, but a few files will not get backed up at all. This can happen if the IA darks an item, or deleted a file after the survey, and so it's not available to download. This makes the stats look bad and wastes client time trying again and again to download the files.
One way to deal with this is to go into the git repository for the shard and delete the files that are not available from the IA. Commit and push it back, and done.
Thing is, the IABak system does not let such change to shard git repos be pushed in, because we don't want users messing with the shards. So, this has to be done on the server. There are checkouts of all the shard git repos under /srv/shard/ and changes can be made in there. IIRC, I have temporarily moved /home/SHARDn/shardn.git/hooks/update out of the way to allow git push of that change to work, of course putting it back afterwards. There is probably a better way.
(Something should be done to handle this automatically.)

Revision as of 01:11, 2 October 2016

Server

The iabak.archiveteam.org server is provided by Kenshin. Closure and db48x are root. db48x set up graphite and some of the web page, and closure set up most of the rest.

Configuration management

Since we want to be able to scale to multiple server instances when the time comes, most (but possibly not all) of the server's configuration is handled by a configuration management system. Closure set this up using propellor (https://propellor.branchable.com) which he also wrote.

Clone his propellor git repository and git checkout the joeyconfig branch, and see the joeyconfig.hs and src/Propellor/Property/SiteSpecific/IABak.hs files for the configuration.

Server scripts

The iabak git repository (https://github.com/ArchiveTeam/IA.BAK) has a "server" branch which contains scripts used on the server. This repository is checked out on the server in /usr/local/IA.BAK/

This is where the code for things like updating the stats on the web page, handling new user registration, and sending shard expiry warning emails, etc lives.

Creating new shards

This assumes you have an account on the server. We're looking for SHARDMASTERS, so step right up..!

Creating a new shard is a four step process:

  • Pick collections to include in the shard
  • Create a shard git repository containing all the files in the items in those collections.
  • Install the shard git repository on the server
  • Update the repolist to include the repository, so clients will begin using it.

Pick collections

Pick some collections that need to be backed up, and have not been backed up before.

We don't have any kind of a list or index of already included collections -- yet -- so make sure existing shards don't already include your collections. On the iabak server, /srv/shard/ contains clones of all the shards git repos, so you can look around in there to check.

For scalability reasons, shards should not have more than around 100,000 files in them. And, it tends to work best for the total size of files in a shard to be in the 1-4 TB range. It takes some guesswork to pick a good combination of collections to meet these targets. Might take a few tries.

Closure generated some candidate shards, which all have a reasonable number of collections in them, but the collections are machine-selected. These are in /var/www/html/candidateshards/ on the server. (If you use one of these lists to create a shard, delete it afterwards to avoid dups.)

Create shard git repository

For this, you will need a clone of the iabak git repository, with the server branch checked out.

You also need to add the file md5_collection_url.txt.pick1.sorted.uniq to the same directory. This is a 21gb file from the Internet Archive's census. There's a copy on the server in /home/joey/IA.BAK/md5_collection_url.txt.pick1.sorted.uniq.

   git clone git@github.com:ArchiveTeam/IA.BAK.git
   cd IA.BAK
   ln -s /home/joey/IA.BAK/md5_collection_url.txt.pick1.sorted.uniq

Now you can run ./mkSHARD to create a shard. It takes two parameters. First parameter is either a file in the format used for the files in /var/www/html/candidateshards/, or the first parameter can just be a list of names of collections. The second parameter is the number of the shard.

So, for example:

   ./mkSHARD "millionbooks pimslibrary Princeton" 12

Or:

   ./mkSHARD /var/www/html/candidateshards/smallestfirst83.lst 12

It will take a while! Eventually, you'll get a SHARDn.git repository created in the current directory. The `git annex info` of the repository will also be displayed. Pay attention to the total size of the shard, and the number of files in it. If it's too big/too small, you can rm -rf the SHARDn.git and try again with a different set of collections.

Install shard git repository

Still in the same directory, run ./setupshardrepo to install the shard. Its first parameter is the full name of the shard (eg "SHARD12"), and the second parameter is the full path to the shard's git repository.

For example:

   sudo ./setupshardrepo SHARD12 `pwd`/SHARD12.git

TODO: Obviously this needs sudo access, so something needs to be done to give SHARDMASTERs sudo access, or make it not need that..

This creates a user named eg SHARD12, and installs the repo in their home directory.

Update repolist

Finally, time to tell clients to use the new shard. In the iabak git repo, check out the master branch, and edit the repolist file. See repolist.README for details about this file. You will probably want to add the new shard in reserve state to start with.

For example:

   shard12 SHARD12@iabak.archiveteam.org:shard12 reserve

Commit and push, and as clients update they will become aware of the shard.

Also, the iabak website will add the shard to the display on its next update.

Adjusting repolist states

From time to time, shards get sufficiently backed up that they no longer need to be marked as active in the repolist, and can be set to maint. Or, a shard in maint may lose redundancy, and need to go back to active to get some more clients to use it. We generally want 2-3 shards in active mode at a time, and the rest in maint, with a few new shards in reserve. See the stats on the website to know when changes need to be made. Then just edit the iabak repository's repolist file, and commit it.

Trimming unavailable files

Sometimes a shard will get almost all files backed up to enough clients, but a few files will not get backed up at all. This can happen if the IA darks an item, or deleted a file after the survey, and so it's not available to download. This makes the stats look bad and wastes client time trying again and again to download the files.

One way to deal with this is to go into the git repository for the shard and delete the files that are not available from the IA. Commit and push it back, and done.

Thing is, the IABak system does not let such change to shard git repos be pushed in, because we don't want users messing with the shards. So, this has to be done on the server. There are checkouts of all the shard git repos under /srv/shard/ and changes can be made in there. IIRC, I have temporarily moved /home/SHARDn/shardn.git/hooks/update out of the way to allow git push of that change to work, of course putting it back afterwards. There is probably a better way.

(Something should be done to handle this automatically.)