ArchiveTeam Warrior
What is the Archive Team Warrior?
The Archive Team Warrior is a virtual archiving appliance. You can run it to help with the ArchiveTeam archiving efforts. It will download sites and upload them to our archive — and it’s really easy to do!
The warrior is a virtual machine, so there is no risk to your computer. The warrior will only use your bandwidth and some of your disk space. It will get tasks from and report progress to the Tracker.
Basic usage
The warrior runs on Windows, OS X and Linux. You’ll need VirtualBox (recommended), VMware workstation/player, or a similar program to run the virtual machine.
Instructions for VirtualBox:
- Download the appliance (174MB).
- In VirtualBox, click File > Import Appliance and open the file.
- Start the virtual machine. It will fetch the latest updates and will eventually tell you to start your web browser.
Once you’ve started your warrior:
- Go to http://localhost:8001/ and check the Settings page.
- Choose a username — we’ll show your progress on the leaderboard.
- Go to the All projects tab and pick a project to work on. Even better: select ArchiveTeam’s Choice to let your warrior work on the most urgent project.
Warrior FAQ
Why am I seeing a message about that no item was received?
It means that there is no work available. This happens for several because:
- There project has just finished and someone is inspecting the work done. If a problem is discovered, items may be re-queued and more work is available.
- In the rare case, you have been banned by a tracker administrator because you were requesting too much work or your internet connection is "unclean". We prefer connections from many public IP addresses, use of non-captive DNS servers, and no proxies/firewalls.
Why am I seeing a message about rate limiting?
Keep in mind that although downloading the internet for digital preservation and fun are the primary goals of all Archive Team activities, serious stress on the target's server may occur. The rate limit is imposed by a tracker administrator and should not be subverted.
Help! The warrior is eating all my bandwidth!
You can limit the warriors bandwidth quite easily for virtualbox as long as you are running a relatively recent version. The option is not offered with a GUI however.
The command
VBoxManage bandwidthctl archiveteam-warrior-2 --name Limit --add network --limit 3
will limit the warrior instance called archiveteam-warrior-2 (The default name of the warrior vm currently) to 3Mb/s. Adjust as needed.
In the latest version of VirtualBox on Windows, the syntax appears to have changed. The correct command now seems to be:
VBoxManage bandwidthctl archiveteam-warrior-2 add netlimit --type network --limit 3
For more info, consult the VirtualBox manual (Chapter 6, Section 9).
I turned my warrior off, will those tasks be lost?
If you've killed your warrior instances then the work your warrior did has been lost, however the tasks will be returned to the pool after a period of time. If you want you can alert the admins via IRC of whats happened, and they can clear the claims your username may of made however this isn't very important on most projects.
I need to disconnect my internet / reboot my PC but I don't want to lose work
If you pause/suspend the warrior instance, most projects will allow resuming of work in progress when you unsuspend the warrior instance.
I told the warrior to shutdown from the interface but nothing has changed! what gives?
The warrior will attempt to finish the current running tasks before shutting down. If you need to shut down right away; go ahead, your progress will be lost however the jobs will eventually cycle out to another user.
How much disk space will the warrior use?
Short answer: it depends on the project.
Long answer: because the way each project defines an item differently, the warrior may be downloading a small file to downloading a whole subsection of a website. The virtual machine is configured by default to use 60GB as an absolute maximum. Any unused virtual machine disk space is not used on the host computer. You may, however, run the virtual machine on less than 60GB if you like to live dangerously. We're downloading the internet after all!
The secondary disk is using up space even though it's not running a project.
Virtual machine disk images do not behave like a regular file. There are several ways to reclaim space:
- Delete the second disk and put back an empty disk. The warrior should reformat the second disk.
- Delete the entire warrior application and re-import it.
- Use the zerofree program and then clone the disk image. Reattach the cloned disk image.
I can't connect to localhost?
The application includes a configuration to set up port forwarding to the guest machine on port 8001 so you can access the interface through your web browser. If this does not happen, you may need to double check your machine's network settings.
The warrior can't connect to the internet?
It may be possible that the virtual machine has picked up the address of the local DNS cache on your computer which the virtual machine does not have access to.
If you experience this on Virtual Box, see this question and answer.
I'm looking at the text scrolling by and I notice some errors? Rsync is not working?
Uh-oh! Something is not right. Notify us immediately in the appropriate IRC channel.
I'm looking at the leaderboard. What's that icon beside the username?
That's just the warrior logo: (click on the image for a larger version). It means that person is using the warrior. Those without the icon are running the scripts manually.
What's that guy doing in the logo?
The place is on fire! But don't worry, he safely escaped with the rescued data in his arms.
I want to log in to the virtual machine. How do I do this?
Unless you know what you are doing, you should not need to do this. But if you want to, the username is root
and the password is archiveteam
.
Press ALT+F3 to switch to virtual console number 3. Use ALT+Left or ALT+Right to switch between virtual consoles. There are 6 virtual consoles in total. Number 1 and 2 are reserved for the warrior.
The warrior seems to have too much overhead. I can't run a VM in a VPS!
You don't need to run a virtual machine. If you are managing a VPS, it's likely you are comfortable with some Linux stuff. Projects can be run manually. Consult the project wiki page or the source code repository readme file.
Why a virtual machine in the first place?
The virtual machine is a quick, safe, and easy way for newcomers to help us out. It offers many features:
- Graphical interface
- Automatically selects which project is important to run
- Self-updating software infrastructure
- Allows for unattended use
- In case of software faults, your machine is not ruined
- Restarts itself in case of runaway programs
- Runs on Windows, Mac OS, Linux painlessly
If you have suggestions for improving this system, please talk to us as described below.
Have you considered using Docker?
Docker was discussed a few times. You are more than welcome to take initiative to put the warrior infrastructure into a Docker instance.
I just imported the ova image and the warrior is stuck on "Preparing the data partition"
This issue has cropped up before and we do not know what causes it. It is recommended to just delete the warrior image and import the ova again. Testing shows the import works the majority of the time.
Why is the default project not working? / Why is a manual project not in the Warrior yet?
Sorry. Sometimes the administrators are too busy...
Where can I file a bug or a feature request?
If the issue is related to the warrior's web interface or the library that grab scripts are using, see seesaw-kit issues. Other issues should be filed into their own repositories as described later in this page.
I still have a question!
Talk to us on IRC. Use #warrior for specific warrior questions or #archiveteam for general questions.
Projects
Previous and current warrior projects:
Project | Status | Began | Finished | Result | Archive Location |
---|---|---|---|---|---|
MobileMe | Archive Posted | April 3, 2012 | Aug 8, 2012 | Success | |
FortuneCity | Archive Posted | April 4, 2012 | April 11, 2012 | Partial Success | archive user lookup |
Tabblo | Archive Posted | May 23, 2012 | May 26, 2012 | Success | archive user lookup |
Picplz | Archive Posted | June 3, 2012 | June 15, 2012 | archive index user lookup | |
Tumblr (test project) | Archive Posted | August 9, 2012 | August 19, 2012 | archive (tar) archive (warc) | |
Cinch.FM | Archive Posted | August 20, 2012 | August 22, 2012 | Success | archive |
City of Heroes | Archive Posted | September 3, 2012 | December 1, 2012 | Success | www forums 1 2 3 4 5 |
Webshots | Archive Posted | October 4, 2012 | November 18, 2012 | index | |
BT Internet | Archive Posted | October 10, 2012 | November 2, 2012 | Success | archive |
Daily Booth | Archive Posted | November 19, 2012 | December 29, 2012 | archive lookup | |
GitHub Downloads | Archive Posted | December 13, 2012 | December 17, 2012 | Success | archive index |
Yahoo! Blog | Archive Posted | January 8, 2013 | January 19, 2013 | archive | |
weblog.nl | Archive Posted | January 19, 2013 | February 2, 2013 | archive lookup | |
URLTeam | Active | all releases | |||
Punchfork | Archive Posted | January 11, 2013 | March 6, 2013 | archive user lookup | |
Xanga | Downloads Paused | January 22, 2013 | February 16, 2013 | archive user lookup user list | |
Posterous | Downloads Finished | February 23, 2013 | June 29, 2013 | archive | |
Storylane | Downloads Finished | March 8, 2013 | March 15, 2013 | ||
Yahoo! Messages | Downloads Finished | March 20, 2013 | March 31, 2013 | archive | |
Formspring | Downloads Finished | March 24, 2013 | September 19, 2013 | Success | archive |
Yahoo Upcoming | Archive Posted | April 20, 2013 | April 25, 2013 | archive | |
Streetfiles.org | Downloads Finished | April 28, 2013 | April 30, 2013 | Partial | archive |
Xanga | Downloads Paused | June 21, 2013 | August 31, 2013 | archive | |
Zapd | Archive Posted | October 1, 2013 | October 8, 2013 | Success | archive |
Blip.tv | Active | October 11, 2013 |
Status
- In Development
- a future project
- Active
- start up a Warrior and join the fun; this one is in progress right now
- Downloads Finished
- we've finished downloading the data
- Archived
- the collected data has been properly archived
- Archive Posted
- the archive is available for download
Result
- Success
- downloaded all of the data and posted the archive publicly
- Qualified Success
- either we couldn't get all of the data, or the archive can't be made public
- Failure
- the site closed before we could download anything
Testing pre-production code
(Don't do this unless you really need or want to.) If you are developing a warrior script, you can test it by switching your warrior from the master
branch to the development
branch or create another branch.
- Start the warrior.
- Press Alt+F2 and log in as
root
, passwordarchiveteam
cd /home/warrior/warrior-code2
sudo -u warrior git checkout development
reboot
By the same route you can return your warrior to the master
branch.
How the warrior works
The warrior image is built off Debian 6.0.5 (squeeze). Here are the basics:
- kernel 2.6.32-5-686 (released 2009-03-12)
- Python 2.6.6, pip 1.1
- Perl v5.10.1, cpan 1.9402 (still needs config)
- gcc 4.4.5, make 3.81, bash 4.1.5
- nano 2.2.4 with color syntax highlighting
- curl 7.21.0
Repositories
The warrior uses the following repos from our GitHub organization:
Client code
Client code includes code that the warrior executes.
- For constructing the virtual appliance image
- Bootstrap code that is pulled from GitHub by the appliance
- Library that helps build grab scripts and the web interface for the warrior. The name "seesaw" comes from its original behavior: download, upload, and repeat.
Projects are in separate repositories typically with the name -grab
as a suffix.
Server code
Server code includes code that the Tracker executes.
- The server of which the Seesaw contacts
- The server of which the warrior appliances contact for project metadata
- The scripts that bundles the WARC files.
URLTeam code
URLTeam code is independent from the tracker and warrior.
- The client code that scrapes the shortlinks. It includes a pipeline shim to run the code.
- The server code for the tracker.
Bootup
- Start the virtual machine
- Linux boots
- The user
warrior
is automatically logged in. /etc/inittab
kicks off/home/warrior/warrior-code2/boot.sh
.- This will
git pull https://github.com/ArchiveTeam/warrior-code2
into/home/warrior/warrior-code2/
. /home/warrior/warrior-code2/warrior-runner.sh
sets up a process which monitors/dev/shm/ready-for-warrior
and launchesrun-warrior
when the state changes.
- This will
boot.sh
launches/home/warrior/warrior-code/boot-part-2.sh
boot-part-2.sh
is a short script that does the following:./warrior-install.sh
- install/update seesaw, check branch, version
- install framebuffer support, DNS caching
- sets up
/data
sudo ./make-data-disk.sh
- cleans up
- creates and prepares the partition
mkdir -p /home/warrior/projects
touch /dev/shm/ready-for-warrior
- triggers the launch of
/usr/local/bin/run-warrior
which launches/home/warrior/warrior-code2/src/seesaw/run-warrior
- contacts warriorhq.archiveteam.org and requests the
projects.json
file. This file contains the projects you see in the Available Projects page.
- triggers the launch of
./say-hello.sh
- setup vmware port forwarding
- show splash screen
- Point your web browser to http://localhost:8001 and go.
The code for each project is stored in /home/warrior/projects/<PROJECTNAME>/