Difference between revisions of "ArchiveTeam Warrior"

From Archiveteam
Jump to navigation Jump to search
(Undo revision 18264 by Chfoo (talk) it's still going via caches)
(→‎Basic usage: add Podman info)
(188 intermediate revisions by 35 users not shown)
Line 1: Line 1:
==What is the Archive Team Warrior?==
{{notice|1=The current versions of the Warrior Docker image and the Warrior virtual machine image should now be compatible with most projects; however some projects may still not be compatible and show a blank screen when attempting to run them. As an alternative, you can [[Running_Archive_Team_Projects_with_Docker|run individual projects manually using Docker]].
 
If you have any issues or feedback, see the [[Archiveteam:IRC|AT #warrior IRC channel on hackint]].}}
 
== What is the Archive Team Warrior? ==


[[Image:Archive_team.png|100px|left]]
[[Image:Archive_team.png|100px|left]]
[[Image:Warrior-vm-screenshot.png|right]]
[[Image:Warrior-vm-screenshot.png||256px|right]]
[[Image:Warrior-web-screenshot.png|right]]
[[Image:Warrior-web-screenshot.png|256px|right]]
[[File:Archiveteam_warrior_infrastructure.png|thumb|right|256px|[[Dev/Infrastructure|Warrior infastructure]]]]
 
The Archive Team Warrior is a virtual archiving appliance. You can run it to help with the Archive Team archiving efforts. It will download sites and upload them to our archive—and it’s really easy to do!
 
The warrior is a container running inside a virtual machine, so there is almost no security risk to your computer. "Almost", because in practice nothing is 100% secure. The warrior will only use your bandwidth and some of your disk space, as well as some of your CPU and memory. It will get tasks from and report progress to the [[Tracker]].
 
== Basic usage ==
 
The Warrior runs on Windows, macOS, and Linux. You can run it using a virtual machine (simplest) or using Docker (slightly more complicated, but much less overhead than the VM).
 
=== Installing and running with a virtual machine ===
 
You'll need:
* The Warrior Appliance (size: 123MB, current version: 3.2), from one of the following locations:
** [https://github.com/ArchiveTeam/Ubuntu-Warrior/releases GitHub]
** [https://warriorhq.archiveteam.org/downloads/warrior3/ Archive Team]
** [https://www.syping.de/archiveteam/ Syping Development (DE)]
** [https://archive.org/details/archiveteam-warrior-v3-20171013 Internet Archive] (outdated but still functional)
* A virtualization application to run it, such as:
** [https://www.virtualbox.org/ VirtualBox] (recommended, open source)
** [https://www.vmware.com/products/player/ VMware Player] (may have some compatibility issues, free-gratis for personal use)
 
==== VirtualBox ====
 
# Download the appliance from the link above.
# Launch VirtualBox.
# In VirtualBox, click <code>File > Import Appliance</code> and open the file.
# Start the virtual machine.
#* It will fetch the latest updates and will eventually tell you to start your web browser.
# Using your regular web browser, visit http://localhost:8001/.
 
A [https://www.youtube.com/watch?v=_nzD-QpmePE video demonstrating these steps] is available. (Note that the screen indicating that the Warrior has finished loading looks different than the one from when this video was made, but the steps are otherwise the same.)


The Archive Team Warrior is a virtual archiving appliance. You can run it to help with the ArchiveTeam archiving efforts. It will download sites and upload them to our archive — and it’s really easy to do!
==== VMware Player ====
Note that VMware Player may have some compatibility issues with running the Warrior image.


The warrior is a virtual machine, so there is no risk to your computer. The warrior will only use your bandwidth and some of your disk space. It will get tasks from and report progress to the [[Tracker]].
# Download the appliance from the link above.
# Launch VMware Player.
# In Player on the right, click "Open Virtual Machine", open the file and import the virtual machine.
# (Optional) Select the virtual machine and click "Edit virtual machine settings".
#* Select Network Adapter and set it to "Bridged: Connected directly to the physical network"
# Start the virtual machine.
#* It will fetch the latest updates and will eventually tell you to start your web browser.
# Using your regular web browser, visit the address that is shown on the bottom (e.g. http://192.168.0.100:8001/)


==Basic usage==
=== Installing and running with Docker ===


The warrior runs on Windows, OS X and Linux. You’ll need [https://www.virtualbox.org/ VirtualBox] (recommended), VMware workstation/player, or a similar program to run the virtual machine.
You'll need [https://docs.docker.com/get-docker/ Docker] (open source) and the Warrior Docker image.


Instructions for VirtualBox:
# Download Docker from the link above and install it.
<ol>
# Open your terminal. On Windows, you can use either Command Prompt (CMD) or PowerShell. On macOS and Linux you can use Terminal (Bash).
  <li>Download the [http://archive.org/download/archiveteam-warrior/archiveteam-warrior-v2-20121008.ova appliance] (174MB).</li>
# Use the following command to start the Warrior as well as Watchtower, which will automatically keep your Warrior updated: <pre>docker run --detach --name watchtower --restart=on-failure --volume /var/run/docker.sock:/var/run/docker.sock containrrr/watchtower --label-enable --cleanup --interval 3600 && docker run --detach --name archiveteam-warrior --label=com.centurylinklabs.watchtower.enable=true --restart=on-failure --publish 8001:8001 atdr.meo.ws/archiveteam/warrior-dockerfile</pre> Note that the current version of this command may not persist Warrior configuration (username, selected project, and item concurrency) across container/dependency updates. These types of updates typically only occur once every few months and are far less frequent than normal script updates, which happen inside the container without affecting the container configuration. (For a full explanation of this command, see items 3 and 4  [[Running_Archive_Team_Projects_with_Docker#Instructions_for_using_Docker_CLI_on_Windows.2C_macOS.2C_or_Linux|here]].)<br/>You may wish to protect the web configuration interface for your Warrior by setting a username and password for the web interface and by adding a rule to your firewall (such as [https://github.com/chaifeng/ufw-docker#solving-ufw-and-docker-issues ufw]).
  <li>In VirtualBox, click File > Import Appliance and open the file.</li>
# Using your regular web browser, visit http://localhost:8001/.
  <li>Start the virtual machine. It will fetch the latest updates and will eventually tell you to start your web browser.</li>
</ol>


Once you’ve started your warrior:
If you prefer '''Podman''' over Docker, [[User:Sanqui]] has had success running the Warrior in Docker using
<ol>
<code>podman run --detach --name at-warrior --label=io.containers.autoupdate  --restart=on-failure --publish 8001:8001 atdr.meo.ws/archiveteam/warrior-dockerfile</code> and [https://docs.podman.io/en/latest/markdown/podman-auto-update.1.html podman-auto-update] in place of Watchtower.
  <li>Go to http://localhost:8001/ and check the Settings page.</li>
  <li>Choose a username — we’ll show your progress on the leaderboard.</li>
  <li>Go to the All projects tab and pick a project to work on. Even better: select ArchiveTeam’s Choice to let your warrior work on the most urgent project.</li>
</ol>


==Warrior FAQ==
__TOC__
__TOC__
=== Why am I seeing a message about that no item was received? ===


It means that there is no work available. This happens for several because:
== Warrior FAQ ==
 
=== Why a virtual machine/container in the first place? ===
 
The Warrior is a quick, safe, and easy way for newcomers to help us out. It offers many features:
 
* Graphical interface (virtual machine only)
* Automatically selects which project is important to run
* Self-updating software infrastructure
* Allows for unattended use
* In case of software faults, your machine is not ruined
* Restarts itself in case of runaway programs
* Runs on Windows, Mac, and Linux painlessly
* Ensures consistency in the archived data regardless of your machine's quirks
* Can be configured to restart automatically after a system restart ([[#How can I set up the Warrior to start up on boot and shut down automatically?|see below]]).
 
If you have suggestions for improving this system, [[#I_still_have_a_question.21|talk to us]].
 
=== Can I use whatever internet access for the Warrior? ===
 
No. We need "clean" connections. Please ensure the following:
 
* No OpenDNS. No ISP DNS that redirects to a search page. Use non-captive DNS servers.
* No ISP connections that inject advertisements into web pages.
* No proxies. Proxies can return bad data. The original HTTP headers and IP address are needed for the WARC file.
* No content-filtering firewalls.
* No censorship. If you believe your country implements censorship, do not run a warrior.
* No Tor. The server may return an error page instead of content if they ban exit nodes.
* No free cafe wifi. Archiving your cafe's wifi service agreement repeatedly is not helpful.
* No VPNs. Data integrity is a very high priority for the Archive Team so use of VPNs with the official crawler is discouraged.
* We prefer connections from many public IP addresses if possible. (For example, if your apartment building uses a single IP address, we don't want your apartment banned.)
 
=== I turned my Warrior off. Will those tasks be lost? ===
 
If you've killed your Warrior, then the work it was doing has been lost. However, the tasks will be returned to the pool after a period of time, and other warriors may claim them.
 
=== I closed my browser or tab with the Warrior's web interface. Will those tasks be lost? ===
 
No. The web browser interface just provides a user interface to the Warrior. As long as the VM or Docker container is not stopped, it will continue normally.
 
=== How can I shut down the Warrior without losing work? ===
 
==== Recommended method ====
 
Click the "Shut down" button on the left of the web interface. All the current tasks will still finish, but no new ones will be started. When a banner appears saying "There is no connection with the warrior", the Warrior has finished shutting down. (If you would rather use the command line than the web interface, [[#How_can_I_run_the_Warrior_headlessly_.28without_leaving_a_window_open.29.3F|see below]]).
 
==== Suspend/resume with the virtual machine ====
 
If you don't want to wait (perhaps because a task is long-running), you can use VirtualBox's <code>Machine > Pause</code> or VMware's <code>VM > Pause</code> to suspend the Warrior VM, then resume it when you are ready to work again. Note that if you keep it suspended for too long (more than a few hours), the tracker will assume that the item is lost and re-queue it—but suspending in order to reboot your computer or reset your internet connection should be perfectly fine.
 
==== Suspend/resume with the Docker container ====
 
If you don't want to wait, you're out of luck; Docker does not have a feature for suspending containers.
 
=== How much disk space will the Warrior use? ===
 
Short answer: it depends on the project. The virtual machine has a hard limit of 60GB disk usage, but the Docker container does not have such a limit. However, it is highly unlikely that any project would use more than 60GB of disk space at any time.
 
Long answer: because each project defines items differently, sizes may vary. A single task may be a small file or a whole subsection of a website. The virtual machine is configured by default to use an absolute maximum of 60GB, but Docker has no hard limit. Any unused virtual machine or Docker container disk space is not used on the host computer. You may configure the virtual machine to run on less than 60GB if you like to live dangerously. We're downloading the internet, after all!
 
=== How can I log into the Warrior? ===
 
Unless you know what you are doing, you should not need to do this.
 
==== Virtual machine ====
 
With the Warrior running, press ALT+F4 to switch to virtual console number 4. VirtualBox users may need to press the host key, RIGHT_CONTROL, to enter capture mode before pressing ALT+F4. Use ALT+Left or ALT+Right to switch between virtual consoles. There are 6 virtual consoles in total. Consoles 1, 2, and 3 are reserved for the warrior. Switching to a new virtual console will show a login shell. You can login using the username <code>root</code> and the password <code>archiveteam</code>.
 
==== Docker container ====
 
With the Warrior running, open your terminal and run <code>sudo docker exec -t -i archiveteam-warrior /bin/bash</code>. Replace 'archiveteam-warrior' with the name of your Warrior container if necessary.
 
=== How can I run multiple Warriors at the same time? ===
 
This usually isn't necessary; if you want to increase your work on a project, you can increase the number of items your Warrior will work on at the same time. In the web interface, go to the "Your settings" tab, tick the "Show advanced settings" box, and edit the "Concurrent items" field. The maximum concurrency is 6.


* There project has just finished and someone is inspecting the work done. If a problem is discovered, items may be re-queued and more work is available.
==== Virtual machines ====
* In the rare case, you have been banned by a tracker administrator because you were requesting too much work or your internet connection is "unclean". We prefer connections from many public IP addresses, use of non-captive DNS servers, and no proxies/firewalls.


=== Why am I seeing a message about rate limiting? ===
You'll need to adjust the networking settings.


Keep in mind that although downloading the internet for digital preservation and fun are the primary goals of all Archive Team activities, serious stress on the target's server may occur. The rate limit is imposed by a [[Tracker#People|tracker administrator]] and should not be subverted.
In VirtualBox, select a virtual machine and open up <code>Settings > Network > Adapter 1 > Port Forwarding</code>. You need to adjust the host port. For example, setting your table to <code>TCP | 127.0.0.1 | 8123 | | 8001</code> will map port 8123 on the host machine (your computer) to port 8001 on the virtual machine (the warrior), and you can then access the warrior's web interface from port 8123 in your browser.


===Help! The warrior is eating all my bandwidth!===
VMware installations should be using bridged networking. However, if you want, you can switch to NAT (under <code>Settings > Hardware > Virtual Network Adapter</code>) and click Edit to set up port forwarding. On Linux, you can also use lines like <code>8123 = 192.168.0.100:8001</code> in the <code>[incomingtcp]</code> section of nat.conf. (Make sure the VM IP is correct!)


You can limit the warriors bandwidth quite easily for virtualbox as long as you are running a relatively recent version. The option is not offered with a GUI however.
Each VM you want to access should have a different host port. Do not use port numbers below 1024 unless you know what you are doing.


The command <pre>VBoxManage bandwidthctl archiveteam-warrior-2 --name Limit --add network --limit 3</pre> will limit the warrior instance called archiveteam-warrior-2 (The default name of the warrior vm currently) to 3Mb/s. Adjust as needed.
==== Docker containers ====


In the latest version of VirtualBox on Windows, the syntax appears to have changed. The correct command now seems to be:
You'll need to adjust the run command used to create the containers.
 
First, each container needs a unique name, so you will need to replace the name specified with the <code>--name</code> parameter with something unique.
 
Second, you will need to specify a unique port to access the web interface of each container. You can do this by changing the number before the <code>:</code> in the <code>--publish</code> parameter to any available unique port number equal to or greater than 1024. (Additional options for specifying ports are explained in the [https://docs.docker.com/network/links/#connect-using-network-port-mapping Docker documentation].)
 
You may also want to reuse your configuration between different Docker containers; you can do this by specifying the same [https://github.com/ArchiveTeam/warrior-dockerfile#using-environment-variables environment variables] or bindmounting the same <code>config.json</code> file across all of your containers. See the [https://github.com/ArchiveTeam/warrior-dockerfile#readme Warrior Dockerfile README] for more details about this.
 
=== How can I run the Warrior headlessly (without leaving a window open)? ===
 
==== Virtual machine ====
 
From the VirtualBox GUI, after opening the VM, click <code>Machine > Detach GUI</code>. You can then close the VirtualBox Manager window.
 
For the VirtualBox CLI, you can start up the VM with <code>VBoxManage startvm archiveteam-warrior-3.2 --type headless</code> and shut it down with <code>VBoxManage controlvm archiveteam-warrior-3.2 acpipowerbutton</code>. Substituting <code>suspend</code> or <code>resume</code> for <code>acpipowerbutton</code> suspends or resumes the VM. For more information, consult [http://www.virtualbox.org/manual/ch08.html#vboxmanage-startvm the VirtualBox manual (Chapter 8, Sections 12 and 13)].
 
For the VMware CLI, you can start up the VM with <code>vmrun start <path to vmx file> nogui</code> and shut it down with <code>vmrun stop <path to vmx file> soft</code>. Substituting <code>suspend</code> for <code>stop</code> suspends the VM; resume with <code>start</code> again. For more information, including the paths to VMX files on different operating systems, consult [http://www.vmware.com/pdf/vix180_vmrun_command.pdf Using vmrun to Control Virtual Machines] (PDF), pages 10 and 11.
 
==== Docker container ====
 
The container does not have a GUI, and if run with <code>--detach</code> (as the instructions suggest), it will not occupy your terminal window either. It is therefore headless by default. You can start up the container with <code>docker start archiveteam-warrior</code> and shut it down with <code>docker kill --signal=SIGINT archiveteam-warrior</code>.
 
=== How can I set up the Warrior to start up on boot and shut down automatically? ===
 
==== Virtual machine ====
 
If you are using VirtualBox and running a Linux distribution that uses the systemd init system (like most recent releases), you can set the VM up as a system service by following the short instructions on [http://www.ericerfanian.com/automatically-starting-virtualbox-vms-on-archlinux-using-systemd/ this page]. (The page title specifies Arch Linux, but this will work for other distros as long as they run systemd.)


<pre>VBoxManage bandwidthctl archiveteam-warrior-2 add netlimit --type network --limit 3</pre>
==== Docker container ====


For more info, consult the [http://www.virtualbox.org/manual/ch06.html#network_bandwidth_limit VirtualBox manual (Chapter 6, Section 9)].
If the container is run with <code>--restart=on-failure</code> (as the instructions suggest), Docker will automatically start it on boot.


===Nat sucks! I want directly bridged networking!===
=== How can I set up the virtual machine with directly-bridged networking instead of NAT? ===


Simples! (If you're running linux that is)
On VirtualBox, use these commands:


<pre>VBoxManage modifyvm "archiveteam-warrior-2" --nic1 bridged</pre>
<pre>VBoxManage modifyvm archiveteam-warrior-3.2 --nic1 bridged
VBoxManage modifyvm archiveteam-warrior-3.2 --bridgeadapter1 eth0</pre>


<pre>VBoxManage modifyvm "archiveteam-warrior-2" --bridgeadapter1 eth0</pre>
We presume you want to bind to <code>eth0</code>. Adjust as required. :)


(We presume you want to bind to eth0, adjust as required :))
VMware installations should already be using bridged networking.


=== I turned my warrior off, will those tasks be lost? ===
=== How can I access the virtual machine from another device on my network? ===


If you've killed your warrior instances then the work your warrior did has been lost, however the tasks will be returned to the pool after a period of time. If you want you can alert the admins via IRC of whats happened, and they can clear the claims your username may of made however this isn't very important on most projects.
Full guide for VirtualBox users is found [https://gist.github.com/HeliosLHC/cf3264c8d65b4680474ac13bcc6d0384 here].


=== I need to disconnect my internet / reboot my PC but I don't want to lose work ===
=== What's new in version 3.2 of the Warrior virtual machine? ===


If you pause/suspend the warrior instance, most projects will allow resuming of work in progress when you unsuspend the warrior instance.
This update enables running newer projects, shortens startup times, enables viewing basic logs from the virtual machine console (press ALT+F2 for Warrior logs, press ALT+F3 for automatic updater logs, and press ALT+F1 to return to the splash screen), and has other minor improvements. Warriors versions 3.0 and 3.1 will automatically update themselves with the project compatibility improvements, but the other improvements require re-creating the VM with version 3.2 of the appliance.


=== I told the warrior to shutdown from the interface but nothing has changed! what gives? ===
=== Are previous versions of the Warrior still supported? ===


The warrior will attempt to finish the current running tasks before shutting down. If you need to shut down right away; go ahead, your progress will be lost however the jobs will eventually cycle out to another user.
==== Virtual machine ====


=== How much disk space will the warrior use? ===
Currently, versions 3.0, 3.1, and 3.2 of the Warrior virtual machine are functional and supported, and are capable of automatically retrieving updated components as needed. Support for version 2 and prior of the Warrior virtual machine was discontinued around 2018 due to outdated SSL support.


Short answer: it depends on the project.
==== Docker container ====


Long answer: because the way each project defines an item differently, the warrior may be downloading a small file to downloading a whole subsection of a website. The virtual machine is configured by default to use 60GB as an absolute maximum. Any unused virtual machine disk space is not used on the host computer. You may, however, run the virtual machine on less than 60GB if you like to live dangerously. We're downloading the internet after all!
We always recommend using the latest version of the Warrior Docker image, as new and updated projects often require the updated components provided by newer Docker images. If you run the Docker container with Watchtower (as the instructions suggest), your Docker container will automatically be kept up-to-date.


=== The secondary disk is using up space even though it's not running a project. ===
=== How can I run tons of Warriors easily? ===


Virtual machine disk images do not behave like a regular file. There are several ways to reclaim space:
We assume you've checked with the current Archive Team project leads what concurrency and resources are needed or useful!


* Delete the second disk and put back an empty disk. The warrior should reformat the second disk.
Whether your have your own virtual cluster or you're renting someone else's (aka a "[https://fsfe.org/activities/nocloud/ cloud]"), you probably need some [[wikipedia:Category:Orchestration_software|orchestration software]].
* Delete the entire warrior application and re-import it.
* Use the zerofree program and then clone the disk image. Reattach the cloned disk image.


=== I can't connect to localhost? ===
Archive Team volunteers have successfully used a variety of hosting providers and tools (including free trials on AWS and GCE), often just by building their own flavor of virtual server and then repeating it with simple [https://cloudinit.readthedocs.io/ cloud-init] scripts  or whatever tool the hosting provides. If you desire full automation, the [https://gitlab.com/diggan/archiveteam-infra archiveteam-infra repository by diggan] helps with [[wikipedia:Terraform (software)|Terraform]] on [[wikipedia:DigitalOcean|DigitalOcean]].


The application includes a configuration to set up port forwarding to the guest machine on port 8001 so you can access the interface through your web browser. If this does not happen, you may need to double check your machine's network settings.
Some custom monitoring scripts also exist, for instance [https://github.com/general-programming/gp-archiveteam-bs/blob/master/tumblr/watcher.py watcher.py].


=== The warrior can't connect to the internet? ===
The instructions for [[#How_can_I_run_multiple_virtual_machines.2FDocker_containers_at_the_same_time.3F|running multiple Warriors on one machine]] may be helpful. However, you should also consider [[Running_Archive_Team_Projects_with_Docker|running Docker containers for individual projects]] rather than the Warrior; these have even less overhead and can be configured with greater concurrency.


It may be possible that the virtual machine has picked up the address of the local DNS cache on your computer which the virtual machine does not have access to.
=== What are the alternatives to using the Warrior? ===


If you experience this on Virtual Box, see [http://askubuntu.com/questions/204953/virtualbox-dns-stopped-working-on-upgrade-to-12-10 this question and answer].
One is [[Running_Archive_Team_Projects_with_Docker|running Docker containers for individual projects]]. This is particularly useful if you want to deploy a large amount of computing power.


=== I'm looking at the text scrolling by and I notice some errors? Rsync is not working? ===
Another is running individual projects directly. Check the source repository for the project you're interested in and follow the instructions for running without a Warrior in the README. This is particularly useful if you want to deploy on a machine where you don't have root.


Uh-oh! Something is not right. Notify us immediately in the appropriate [[IRC]] channel.
We generally recommend that people use the Warrior, because it is simple for non-technical users to set up and it requires no supervision. You should only use these alternatives if you are comfortable with Linux and prepared to manually intervene when projects begin and end.


=== I'm looking at the leaderboard. What's that icon beside the username? ===
=== I'm looking at the leaderboard. What's that icon beside the username? ===


That's just the warrior logo: [[File:Archive_team.png|42px]] (click on the image for a larger version). It means that person is using the warrior. Those without the icon are running the scripts manually.
That's just the warrior logo: [[File:Archive_team.png|42px]] (click on the image for a larger version). It means that that person is using the warrior. Those without the icon are running the project manually.
 
[[Image:Archiveteam-warrior-sticker.png‎|256px|right]]


=== What's that guy doing in the logo? ===
=== What's that guy doing in the logo? ===
Line 110: Line 246:
The place is on fire! But don't worry, he safely escaped with the rescued data in his arms.
The place is on fire! But don't worry, he safely escaped with the rescued data in his arms.


=== I want to log in to the virtual machine. How do I do this? ===
=== That’s awesome—can I slap this logo on my laptop to show my Internet-preservation pride? ===


Unless you know what you are doing, you should not need to do this. But if you want to, the username is <code>root</code> and the password is <code>archiveteam</code>. Then, you can execute <code>sudo -u warrior -i</code> to log in as the warrior user.  
[http://www.redbubble.com/people/ajhajh/works/12857655-archive-team-warrior-stickers?p=sticker You sure can!] The ArchiveTeam Warrior laptop sticker can start conversations about archiving, if you’re into that.


Press ALT+F3 to switch to virtual console number 3. Use ALT+Left or ALT+Right to switch between virtual consoles. There are 6 virtual consoles in total. Number 1 and 2 are reserved for the warrior.
=== I'd like to help write code or I want to tweak the scripts to run to my liking. Where can I find more info? Where is the source code and repository? ===


=== The warrior seems to have too much overhead. I can't run a VM in a VPS! ===
In order to ensure data accuracy, it is imperative that users contributing to Archive Team projects '''do not modify the project scripts'''. If you would like to propose improvements to be included in future official versions of/updates to project scripts or would like to use our code for non-Archive Team projects, check out the [[Dev]] documentation for details on the infrastructure and details of the source code layout.


You don't need to run a virtual machine. If you are managing a VPS, it's likely you are comfortable with some Linux stuff. Projects can be run manually. Consult the project wiki page or the source code repository readme file.
=== I still have a question! ===
 
Check out the [[Frequently Asked Questions|general FAQ page]]. Talk to us on [[IRC]]. Use [ircs://irc.hackint.org:6697/warrior #warrior] for specific warrior questions or [ircs://irc.hackint.org:6697/archiveteam-bs #archiveteam-bs] for general questions.
 
== Troubleshooting ==
 
=== I'm getting errors when I try to launch the VM. ===
 
If you are receiving <code>Breakpoint has been reached (0x80000003)</code>, <code>A critical error has occurred while running the virtual machine and the machine execution has been stopped.</code>, or VT-X errors, you probably do not have virtualization enabled, either because it is turned off in your computer's BIOS or your CPU does not support it.
 
You can check CPU support on Linux with <code>cat /proc/cpuinfo | grep "(vmx|svm)" | uniq</code>. If there is a line of output starting with "flags", your processor supports virtualization; if there is no output, it does not. You can check whether virtualization is enabled in the BIOS using the <code>rdmsr</code> utility in your distro's <code>msr-tools</code> package.
 
You can check support and BIOS status on Windows using [https://www.microsoft.com/en-us/download/details.aspx?id=592 Microsoft's Hardware-Assisted Virtualization Detection Tool] or [http://openlibsys.org/index-ja.html VirtualChecker].
 
To enable virtualization on a CPU with support, reboot the computer and enter the BIOS. The virtualization setting is usually under something like 'CPU configuration' or 'advanced settings'.
 
<!--=== I just imported the ova image and the warrior is stuck on "Preparing the data partition". ===
 
This issue has cropped up before, and we do not know what causes it. We recommend you delete the warrior image and import the ova again. Testing shows that such a reimport works in the majority of cases.
-->
=== I can't connect to localhost. ===
 
==== Virtual machine ====
 
The application is configured to set up port forwarding to the guest machine, and you should be able to access the interface through your web browser at port 8001. If this does not happen, and isn't resolved by rebooting the warrior (using the ACPI power signals, not suspend/save state and resume), you may need to double-check your machine's network settings (as described [[#How_can_I_run_multiple_virtual_machines_at_the_same_time.3F|above]]).
 
==== Docker container ====
 
Make sure you invoked <code>docker run</code> with the option <code>--publish</code>. To access the web interface at http://localhost:X/, you must use <code>--publish X:8001</code>.
 
=== The warrior can't connect to the internet. ===


=== Why a virtual machine in the first place? ===
This may manifest as the following error:


The virtual machine is a quick, safe, and easy way for newcomers to help us out. It offers many features:
<code>
Checking Internet
wget: bad address 'warriorhq.archiveteam.org'
Unable to access the Internet
</code>


* Graphical interface
It's possible that the virtual machine has picked up the address of the local DNS cache on your computer, which the virtual machine does not have access to.
* Automatically selects which project is important to run
 
* Self-updating software infrastructure
If you experience this on VirtualBox, see [http://askubuntu.com/questions/204953/virtualbox-dns-stopped-working-on-upgrade-to-12-10 this question and answer]. Additionally, check to see if "Cable Connected" is unchecked in the advanced settings of the virtual adapter, under the network tab in the virtual machine's settings. Check it if it's unchecked, then save your settings.
* Allows for unattended use
 
* In case of software faults, your machine is not ruined
Another option is to switch to "Host bridge" under settings of the network adapter.  If you do this, you won't be able to connect to 127.0.0.1, instead use the first IP in the list below (without the /32, with :8001 at the end).
* Restarts itself in case of runaway programs
 
* Runs on Windows, Mac OS, Linux painlessly
=== I see a message that no item was received. ===
* Ensures consistency in the archived data regardless of your machine's quirks
 
This means that there is no work available. This can happen for several reasons:
 
* The project has just finished and someone is inspecting the work done. If a problem is discovered, items may be re-queued and more work will become available.
* You have checked out/claimed too many items. Reduce your concurrency and let others do some of the work too.
* In a rare case, you have been banned by a tracker administrator because there was a problem with your work: you were requesting too much, you were tampering with the scripts, a malfunction has occurred, or your internet connection is "unclean" (see [[#Can_I_use_whatever_internet_access_for_the_warrior.3F|above]]).
 
=== I see a message about rate limiting. ===
 
Don't worry. Keep in mind that although downloading the internet for fun and digital preservation are the primary goals of all Archive Team activities, serious stress on the target's server may occur. The rate limit is imposed by a [[Tracker#People|tracker administrator]] and should not be subverted.
 
(In other words, we don't want to DDoS the servers.)
 
If you like, you can switch to another [[Warrior projects|project]] with less load.
 
=== I see a message about code being out of date. ===
 
Don't worry. There is a new update ready. You do not need to do anything about this; the Warrior will update its code every hour. If you are impatient, please reboot the warrior and it will download the latest code and resume work.
 
=== I'm running a project manually and I see a message about code being out of date. ===
 
This happens when a bug in the scripts is discovered. Bugs are unavoidable, especially when the server is out of our control.
 
If you are running the scripts using Docker, we recommend using Watchtower to check for updates every hour, downloading and installing them when necessary. See the setup instructions in [[Running_Archive_Team_Projects_with_Docker|Running Archive Team Projects with Docker]] for more details.
 
If you are not running the scripts using the provided Docker images, try the <code>--auto-update</code> option available in Seesaw version 0.8. However, please be aware that you are now executing code automatically. Be sure to run the scripts in a separate user account for safety.
 
=== I see messages about rsync errors. ===
 
Uh-oh! Something is not right. Please notify us immediately in the appropriate [[IRC]] channel.
 
=== I told the warrior to shut down from the interface, but nothing has changed. ===
 
The warrior will attempt to finish the current running tasks before shutting down. If you need to shut down right away, go ahead. Your progress will be lost, but the jobs will eventually cycle out to another user.
 
=== The warrior is eating all my bandwidth! ===
 
==== Virtual machine ====
 
On VirtualBox (relatively recent versions), use this command:
 
<pre>VBoxManage bandwidthctl archiveteam-warrior-3.2 add limit --type network --limit 3m</pre>
 
This will limit the warrior to 3Mb/s. (Limit units are <code>k</code> for kilobit, <code>m</code> for megabit, <code>g</code> for gigabit, <code>K</code> for kilobyte, <code>M</code> for megabyte, and <code>G</code> for gigabyte.)  Adjust as required. :)
 
In the latest version of VirtualBox on Windows, the syntax appears to have changed. The correct command now seems to be:
 
<pre>VBoxManage bandwidthctl archiveteam-warrior-3.2 add netlimit --type network --limit 3</pre>
 
For more information, consult [http://www.virtualbox.org/manual/ch06.html#network_bandwidth_limit the VirtualBox manual (Chapter 6, Section 9)].
 
On VMware (versions 9 and above), select a virtual machine and open <code>Settings > Hardware > Virtual Network Adapter > Advanced</code>. You can set a bandwidth limit here.
 
==== Docker container ====
 
You're out of luck; Docker has no feature for limiting bandwidth.
 
=== The Warrior virtual machine is using up disk space, even though it's not running a project! ===


If you have suggestions for improving this system, please talk to us as described below.
Virtual machine disk images do not behave like a regular file. There are several ways to safely reclaim space:


=== Can I use Docker instead? ===
* Delete the entire warrior application and re-import it.
* Use the VirtualBox CLI to compact the disk. First, shut down the VM. Then, open a terminal and navigate to the folder where the hard-disk VDI file is stored. Finally, run <code>VBoxManage modifymedium --compact archiveteam-warrior-v3.2-20210306-disk001.vdi</code>, replacing <code>archiveteam-warrior-v3.2-20210306-disk001.vdi</code> with the name of the VDI file in use by the Warrior VM. See the [https://www.virtualbox.org/manual/ch08.html#vboxmanage-modifymedium VirtualBox documentation] for more details and additional steps to help achieve a better result.
* Use the [http://intgat.tigress.co.uk/rmy/uml/index.html zerofree] program and then clone the disk image. Reattach the cloned disk image.


[https://github.com/ArchiveTeam/warrior-dockerfile Yes!]. Thanks to the initiative of an individual, the warrior infrastructure can run in a Docker instance.
This issue should not affect Docker containers.


=== What about Windows Hyper-V, can I use that instead? ===
=== My Warrior crashed, or I had to hard-stop it, and I don't think there's time to retry the tasks. ===


[http://jonimoose.net/2013/archiveteam-warrior-on-hyper-v/ Yes!]. Thanks to the initiative of an individual, a modified Warrior VM is all set to be imported into Windows Hyper-V Servers.  
Please notify us in the project [[Archiveteam:IRC|IRC]] channel, including for stops due to system failures and power outages as well as hard-kills. Do not attempt to start or restart the affected containers. If time is indeed a concern, we can help you save or recover partial data from your Warrior.


=== I just imported the ova image and the warrior is stuck on "Preparing the data partition" ===
(The same applies if your Warrior is still running, but tasks are now stuck—especially if they're stuck because the project has reached its deadline and the target site is now gone. Most projects should handle this gracefully, but contact us if they do not.)


This issue has cropped up before and we do not know what causes it. It is recommended to just delete the warrior image and import the ova again. Testing shows the import works the majority of the time.
=== The item I'm working on is downloading thousands of URLs and it's taking hours. ===
 
Please notify us in the appropriate [[IRC]] channel. You may need to reboot the Warrior.


=== Why is the default project not working? / Why is a manual project not in the Warrior yet? ===
=== Why is the default project not working? / Why is a manual project not in the Warrior yet? ===
Line 151: Line 382:
Sorry. Sometimes the administrators are too busy...
Sorry. Sometimes the administrators are too busy...


=== Where can I file a bug or a feature request? ===
=== Why are there no projects? ===
 
We finished the ones we were working on! If there are no projects showing, you can [[Dev|help us write one]]. No projects does ''not'' mean there is nothing left to archive!


If the issue is related to the warrior's web interface or the library that grab scripts are using, see [https://github.com/ArchiveTeam/seesaw-kit/issues seesaw-kit issues]. Other issues should be filed into their own [[Dev/Source_Code|repositories]].
=== The instructions to run the software/scripts are awful and they are difficult to set up. ===


=== I still have a question! ===
Well, excuuuuse me, princess!


Talk to us on [[IRC]]. Use [irc://irc.efnet.org/warrior #warrior] for specific warrior questions or [irc://irc.efnet.org/archiveteam #archiveteam] for general questions.
We're not a professional support team so help us help you help us all. See above for [[#Where_can_I_file_a_bug.2C_suggestion.2C_or_a_feature_request.3F|bug reports]], [[#Where_can_I_file_a_bug.2C_suggestion.2C_or_a_feature_request.3F|suggestions]], or [[#I.27d_like_to_help_write_code._Where_can_I_find_more_info.3F|code contributions]].


== Projects ==
=== Where can I file a bug, suggestion, or a feature request? ===


Previous and current warrior projects:
If the issue is related to the warrior's web interface or the library that grab scripts are using, see [https://github.com/ArchiveTeam/seesaw-kit/issues seesaw-kit issues]. Other issues should be filed into their own [[Dev/Source_Code|repositories]].


{| class="wikitable"
== Projects ==
! Project !! Status !! Began !! Finished !! Result !! Archive Location
|-
| [[MobileMe]] || '''Archive Posted''' || April 3, 2012 || Aug 8, 2012 || Success ||
[http://archive.org/details/archiveteam-mobileme-hero archive] [http://archive.org/details/archiveteam-mobileme-index index] [http://archive.org/download/archiveteam-mobileme-index/mobileme-20120817.html user lookup]
|-
| [[FortuneCity]] || '''Archive Posted''' || April 4, 2012 || April 11, 2012 || Partial Success || [http://archive.org/details/archiveteam-fortunecity archive] [http://archive.org/download/test-memac-index-test/fortunecity.html user lookup]
|-
| [[Tabblo]] || '''Archive Posted''' || May 23, 2012 || May 26, 2012 || Success || [http://archive.org/details/tabblo-archive archive] [http://archive.org/download/test-memac-index-test/tabblo.html user lookup]
|-
| [[Picplz]] || '''Archive Posted''' || June 3, 2012 || June 15, 2012 || || [http://archive.org/details/archiveteam-picplz archive] [http://archive.org/details/archiveteam-picplz-index index] [http://archive.org/download/archiveteam-picplz-index/picplz-20120823.html user lookup]
|-
| [[Tumblr]] (test project) || '''Archive Posted''' || August 9, 2012 || August 19, 2012 || || [http://archive.org/details/archiveteam-tumblr-test archive (tar)] [http://archive.org/details/archiveteam-tumblr-test-warc archive (warc)]
|-
| [[Cinch]].FM || '''Archive Posted''' || August 20, 2012 || August 22, 2012 || Success || [http://archive.org/details/archiveteam-cinch archive]
|-
| [[City of Heroes]] || '''Archive Posted''' || September 3, 2012 || December 1, 2012 || Success || [http://archive.org/details/archiveteam-city-of-heroes-www www] [http://archive.org/details/archiveteam-city-of-heroes-main forums] [http://archive.org/details/archiveteam-city-of-heroes-forums-megawarc-1 1] [http://archive.org/details/archiveteam-city-of-heroes-forums-megawarc-2 2] [http://archive.org/details/archiveteam-city-of-heroes-forums-megawarc-3 3] [http://archive.org/details/archiveteam-city-of-heroes-forums-megawarc-4 4] [http://archive.org/details/archiveteam-city-of-heroes-forums-megawarc-5 5]
|-
| [[Webshots]] || '''Archive Posted''' || October 4, 2012 || November 18, 2012 || || [http://archive.org/download/webshots-freeze-frame-index/index.html index]
|-
| [[BT Internet]] || '''Archive Posted''' || October 10, 2012 || November 2, 2012 || Success || [http://archive.org/details/archiveteam-btinternet archive]
|-
| [[DailyBooth| Daily Booth]] || '''Archive Posted''' || November 19, 2012 || December 29, 2012 || || [http://archive.org/details/archiveteam_dailybooth archive] [http://archive.org/download/dailybooth-freeze-frame-index/index.html lookup]
|-
| [[GitHub Downloads]] || '''Archive Posted''' || December 13, 2012 || December 17, 2012 || Success || [http://archive.org/details/github-downloads-2012-12 archive] [http://archive.org/details/archiveteam-github-repository-index-201212 index]
|-
| [[Yahoo! Blog]] || '''Archive Posted''' || January 8, 2013 || January 19, 2013 || || [http://archive.org/details/yahoo_korea_blogs archive]
|-
| [[weblog.nl]] || '''Archive Posted''' || January 19, 2013 || February 2, 2013 || || [http://archive.org/details/archiveteam_weblognl archive] [http://archive.org/download/archiveteam_weblognl-index/ lookup]
|-
| [[URLTeam]] || Active || || || || [http://urlte.am/releases/ all releases]
|-
| [[Punchfork]] || '''Archive Posted''' || January 11, 2013 || March 6, 2013 || || [http://archive.org/details/archiveteam_punchfork archive] [http://archive.org/download/archiveteam_punchfork_index/ user lookup]
|-
| [[Xanga]] || Downloads Paused || January 22, 2013 || February 16, 2013 || || [http://archive.org/details/archiveteam_xanga archive] [http://archive.org/download/archiveteam_xanga_index/ user lookup] [http://archive.org/details/archiveteam-xanga-userlist-20130142 user list]
|-
| [[Posterous]] || '''Archive Posted''' || February 23, 2013 || June 29, 2013 || || [http://archive.org/details/archiveteam_posterous archive]
|-
| [[Storylane]] || Downloads Finished || March 8, 2013 || March 15, 2013 || || {{update_me}}
|-
| [[Yahoo! Messages]] || '''Archive Posted''' || March 20, 2013 || March 31, 2013 || || [http://archive.org/details/archiveteam_yahoo_messages archive]
|-
| [[Formspring]] || '''Archive Posted''' || March 24, 2013 || September 19, 2013 || Success || [http://archive.org/details/archiveteam_formspring archive]
|-
| [[Yahoo Upcoming]] || '''Archive Posted''' || April 20, 2013 || April 25, 2013 || || [http://archive.org/details/archiveteam archive]
|-
| [[Streetfiles]].org || '''Archive Posted''' || April 28, 2013 || April 30, 2013 || Partial || [http://archive.org/details/archiveteam archive]
|-
| [[Xanga]] || Downloads Paused || June 21, 2013 || August 31, 2013 || || [http://archive.org/details/archiveteam_xanga archive]
|-
| [[Zapd]] || '''Archive Posted''' || October 1, 2013 || October 8, 2013 || Success || [https://archive.org/details/archiveteam_zapd archive]
|-
| [[Blip.tv]] || Hiatus || October 11, 2013 || ||  ||
|-
| [[Hyves]] || '''Archives Posted''' || November 10, 2013 || December 2, 2013 || Success ||  [http://archive.org/details/hyves archive]
|-
| [[Wretch]] & [[Yahoo! Blog]] || Active || December 17, 2013 ||  ||  || 
|}


=== Status ===
See [[Warrior projects]].
:; In Development : a future project
:; Active : start up a Warrior and join the fun; this one is in progress right now
:; Downloads Finished : we've finished downloading the data
:; Archived : the collected data has been properly archived
:; Archive Posted : the archive is available for download


=== Result ===
== Are you a coder? ==
:; Success : downloaded all of the data and posted the archive publicly
:; Qualified Success :  either we couldn't get all of the data, or the archive can't be made public
:; Failure : the site closed before we could download anything


=== Are you a coder? ===
Like the Warrior? Interested in how it works under the hood? Got software skills? '''[[Dev|Help us improve it!]]'''


Like the warrior? Interested in how it works under the hood? Got software skills? '''[[Dev|Help us improve it!]]'''
{{Navigation box}}

Revision as of 07:33, 26 April 2022

Archiveteam1.png The current versions of the Warrior Docker image and the Warrior virtual machine image should now be compatible with most projects; however some projects may still not be compatible and show a blank screen when attempting to run them. As an alternative, you can run individual projects manually using Docker.

If you have any issues or feedback, see the AT #warrior IRC channel on hackint.

What is the Archive Team Warrior?

Archive team.png
Warrior-vm-screenshot.png
Warrior-web-screenshot.png

The Archive Team Warrior is a virtual archiving appliance. You can run it to help with the Archive Team archiving efforts. It will download sites and upload them to our archive—and it’s really easy to do!

The warrior is a container running inside a virtual machine, so there is almost no security risk to your computer. "Almost", because in practice nothing is 100% secure. The warrior will only use your bandwidth and some of your disk space, as well as some of your CPU and memory. It will get tasks from and report progress to the Tracker.

Basic usage

The Warrior runs on Windows, macOS, and Linux. You can run it using a virtual machine (simplest) or using Docker (slightly more complicated, but much less overhead than the VM).

Installing and running with a virtual machine

You'll need:

VirtualBox

  1. Download the appliance from the link above.
  2. Launch VirtualBox.
  3. In VirtualBox, click File > Import Appliance and open the file.
  4. Start the virtual machine.
    • It will fetch the latest updates and will eventually tell you to start your web browser.
  5. Using your regular web browser, visit http://localhost:8001/.

A video demonstrating these steps is available. (Note that the screen indicating that the Warrior has finished loading looks different than the one from when this video was made, but the steps are otherwise the same.)

VMware Player

Note that VMware Player may have some compatibility issues with running the Warrior image.

  1. Download the appliance from the link above.
  2. Launch VMware Player.
  3. In Player on the right, click "Open Virtual Machine", open the file and import the virtual machine.
  4. (Optional) Select the virtual machine and click "Edit virtual machine settings".
    • Select Network Adapter and set it to "Bridged: Connected directly to the physical network"
  5. Start the virtual machine.
    • It will fetch the latest updates and will eventually tell you to start your web browser.
  6. Using your regular web browser, visit the address that is shown on the bottom (e.g. http://192.168.0.100:8001/)

Installing and running with Docker

You'll need Docker (open source) and the Warrior Docker image.

  1. Download Docker from the link above and install it.
  2. Open your terminal. On Windows, you can use either Command Prompt (CMD) or PowerShell. On macOS and Linux you can use Terminal (Bash).
  3. Use the following command to start the Warrior as well as Watchtower, which will automatically keep your Warrior updated:
    docker run --detach --name watchtower --restart=on-failure --volume /var/run/docker.sock:/var/run/docker.sock containrrr/watchtower --label-enable --cleanup --interval 3600 && docker run --detach --name archiveteam-warrior --label=com.centurylinklabs.watchtower.enable=true --restart=on-failure --publish 8001:8001 atdr.meo.ws/archiveteam/warrior-dockerfile
    Note that the current version of this command may not persist Warrior configuration (username, selected project, and item concurrency) across container/dependency updates. These types of updates typically only occur once every few months and are far less frequent than normal script updates, which happen inside the container without affecting the container configuration. (For a full explanation of this command, see items 3 and 4 here.)
    You may wish to protect the web configuration interface for your Warrior by setting a username and password for the web interface and by adding a rule to your firewall (such as ufw).
  4. Using your regular web browser, visit http://localhost:8001/.

If you prefer Podman over Docker, User:Sanqui has had success running the Warrior in Docker using podman run --detach --name at-warrior --label=io.containers.autoupdate --restart=on-failure --publish 8001:8001 atdr.meo.ws/archiveteam/warrior-dockerfile and podman-auto-update in place of Watchtower.

Warrior FAQ

Why a virtual machine/container in the first place?

The Warrior is a quick, safe, and easy way for newcomers to help us out. It offers many features:

  • Graphical interface (virtual machine only)
  • Automatically selects which project is important to run
  • Self-updating software infrastructure
  • Allows for unattended use
  • In case of software faults, your machine is not ruined
  • Restarts itself in case of runaway programs
  • Runs on Windows, Mac, and Linux painlessly
  • Ensures consistency in the archived data regardless of your machine's quirks
  • Can be configured to restart automatically after a system restart (see below).

If you have suggestions for improving this system, talk to us.

Can I use whatever internet access for the Warrior?

No. We need "clean" connections. Please ensure the following:

  • No OpenDNS. No ISP DNS that redirects to a search page. Use non-captive DNS servers.
  • No ISP connections that inject advertisements into web pages.
  • No proxies. Proxies can return bad data. The original HTTP headers and IP address are needed for the WARC file.
  • No content-filtering firewalls.
  • No censorship. If you believe your country implements censorship, do not run a warrior.
  • No Tor. The server may return an error page instead of content if they ban exit nodes.
  • No free cafe wifi. Archiving your cafe's wifi service agreement repeatedly is not helpful.
  • No VPNs. Data integrity is a very high priority for the Archive Team so use of VPNs with the official crawler is discouraged.
  • We prefer connections from many public IP addresses if possible. (For example, if your apartment building uses a single IP address, we don't want your apartment banned.)

I turned my Warrior off. Will those tasks be lost?

If you've killed your Warrior, then the work it was doing has been lost. However, the tasks will be returned to the pool after a period of time, and other warriors may claim them.

I closed my browser or tab with the Warrior's web interface. Will those tasks be lost?

No. The web browser interface just provides a user interface to the Warrior. As long as the VM or Docker container is not stopped, it will continue normally.

How can I shut down the Warrior without losing work?

Recommended method

Click the "Shut down" button on the left of the web interface. All the current tasks will still finish, but no new ones will be started. When a banner appears saying "There is no connection with the warrior", the Warrior has finished shutting down. (If you would rather use the command line than the web interface, see below).

Suspend/resume with the virtual machine

If you don't want to wait (perhaps because a task is long-running), you can use VirtualBox's Machine > Pause or VMware's VM > Pause to suspend the Warrior VM, then resume it when you are ready to work again. Note that if you keep it suspended for too long (more than a few hours), the tracker will assume that the item is lost and re-queue it—but suspending in order to reboot your computer or reset your internet connection should be perfectly fine.

Suspend/resume with the Docker container

If you don't want to wait, you're out of luck; Docker does not have a feature for suspending containers.

How much disk space will the Warrior use?

Short answer: it depends on the project. The virtual machine has a hard limit of 60GB disk usage, but the Docker container does not have such a limit. However, it is highly unlikely that any project would use more than 60GB of disk space at any time.

Long answer: because each project defines items differently, sizes may vary. A single task may be a small file or a whole subsection of a website. The virtual machine is configured by default to use an absolute maximum of 60GB, but Docker has no hard limit. Any unused virtual machine or Docker container disk space is not used on the host computer. You may configure the virtual machine to run on less than 60GB if you like to live dangerously. We're downloading the internet, after all!

How can I log into the Warrior?

Unless you know what you are doing, you should not need to do this.

Virtual machine

With the Warrior running, press ALT+F4 to switch to virtual console number 4. VirtualBox users may need to press the host key, RIGHT_CONTROL, to enter capture mode before pressing ALT+F4. Use ALT+Left or ALT+Right to switch between virtual consoles. There are 6 virtual consoles in total. Consoles 1, 2, and 3 are reserved for the warrior. Switching to a new virtual console will show a login shell. You can login using the username root and the password archiveteam.

Docker container

With the Warrior running, open your terminal and run sudo docker exec -t -i archiveteam-warrior /bin/bash. Replace 'archiveteam-warrior' with the name of your Warrior container if necessary.

How can I run multiple Warriors at the same time?

This usually isn't necessary; if you want to increase your work on a project, you can increase the number of items your Warrior will work on at the same time. In the web interface, go to the "Your settings" tab, tick the "Show advanced settings" box, and edit the "Concurrent items" field. The maximum concurrency is 6.

Virtual machines

You'll need to adjust the networking settings.

In VirtualBox, select a virtual machine and open up Settings > Network > Adapter 1 > Port Forwarding. You need to adjust the host port. For example, setting your table to TCP | 127.0.0.1 | 8123 | | 8001 will map port 8123 on the host machine (your computer) to port 8001 on the virtual machine (the warrior), and you can then access the warrior's web interface from port 8123 in your browser.

VMware installations should be using bridged networking. However, if you want, you can switch to NAT (under Settings > Hardware > Virtual Network Adapter) and click Edit to set up port forwarding. On Linux, you can also use lines like 8123 = 192.168.0.100:8001 in the [incomingtcp] section of nat.conf. (Make sure the VM IP is correct!)

Each VM you want to access should have a different host port. Do not use port numbers below 1024 unless you know what you are doing.

Docker containers

You'll need to adjust the run command used to create the containers.

First, each container needs a unique name, so you will need to replace the name specified with the --name parameter with something unique.

Second, you will need to specify a unique port to access the web interface of each container. You can do this by changing the number before the : in the --publish parameter to any available unique port number equal to or greater than 1024. (Additional options for specifying ports are explained in the Docker documentation.)

You may also want to reuse your configuration between different Docker containers; you can do this by specifying the same environment variables or bindmounting the same config.json file across all of your containers. See the Warrior Dockerfile README for more details about this.

How can I run the Warrior headlessly (without leaving a window open)?

Virtual machine

From the VirtualBox GUI, after opening the VM, click Machine > Detach GUI. You can then close the VirtualBox Manager window.

For the VirtualBox CLI, you can start up the VM with VBoxManage startvm archiveteam-warrior-3.2 --type headless and shut it down with VBoxManage controlvm archiveteam-warrior-3.2 acpipowerbutton. Substituting suspend or resume for acpipowerbutton suspends or resumes the VM. For more information, consult the VirtualBox manual (Chapter 8, Sections 12 and 13).

For the VMware CLI, you can start up the VM with vmrun start <path to vmx file> nogui and shut it down with vmrun stop <path to vmx file> soft. Substituting suspend for stop suspends the VM; resume with start again. For more information, including the paths to VMX files on different operating systems, consult Using vmrun to Control Virtual Machines (PDF), pages 10 and 11.

Docker container

The container does not have a GUI, and if run with --detach (as the instructions suggest), it will not occupy your terminal window either. It is therefore headless by default. You can start up the container with docker start archiveteam-warrior and shut it down with docker kill --signal=SIGINT archiveteam-warrior.

How can I set up the Warrior to start up on boot and shut down automatically?

Virtual machine

If you are using VirtualBox and running a Linux distribution that uses the systemd init system (like most recent releases), you can set the VM up as a system service by following the short instructions on this page. (The page title specifies Arch Linux, but this will work for other distros as long as they run systemd.)

Docker container

If the container is run with --restart=on-failure (as the instructions suggest), Docker will automatically start it on boot.

How can I set up the virtual machine with directly-bridged networking instead of NAT?

On VirtualBox, use these commands:

VBoxManage modifyvm archiveteam-warrior-3.2 --nic1 bridged
VBoxManage modifyvm archiveteam-warrior-3.2 --bridgeadapter1 eth0

We presume you want to bind to eth0. Adjust as required. :)

VMware installations should already be using bridged networking.

How can I access the virtual machine from another device on my network?

Full guide for VirtualBox users is found here.

What's new in version 3.2 of the Warrior virtual machine?

This update enables running newer projects, shortens startup times, enables viewing basic logs from the virtual machine console (press ALT+F2 for Warrior logs, press ALT+F3 for automatic updater logs, and press ALT+F1 to return to the splash screen), and has other minor improvements. Warriors versions 3.0 and 3.1 will automatically update themselves with the project compatibility improvements, but the other improvements require re-creating the VM with version 3.2 of the appliance.

Are previous versions of the Warrior still supported?

Virtual machine

Currently, versions 3.0, 3.1, and 3.2 of the Warrior virtual machine are functional and supported, and are capable of automatically retrieving updated components as needed. Support for version 2 and prior of the Warrior virtual machine was discontinued around 2018 due to outdated SSL support.

Docker container

We always recommend using the latest version of the Warrior Docker image, as new and updated projects often require the updated components provided by newer Docker images. If you run the Docker container with Watchtower (as the instructions suggest), your Docker container will automatically be kept up-to-date.

How can I run tons of Warriors easily?

We assume you've checked with the current Archive Team project leads what concurrency and resources are needed or useful!

Whether your have your own virtual cluster or you're renting someone else's (aka a "cloud"), you probably need some orchestration software.

Archive Team volunteers have successfully used a variety of hosting providers and tools (including free trials on AWS and GCE), often just by building their own flavor of virtual server and then repeating it with simple cloud-init scripts or whatever tool the hosting provides. If you desire full automation, the archiveteam-infra repository by diggan helps with Terraform on DigitalOcean.

Some custom monitoring scripts also exist, for instance watcher.py.

The instructions for running multiple Warriors on one machine may be helpful. However, you should also consider running Docker containers for individual projects rather than the Warrior; these have even less overhead and can be configured with greater concurrency.

What are the alternatives to using the Warrior?

One is running Docker containers for individual projects. This is particularly useful if you want to deploy a large amount of computing power.

Another is running individual projects directly. Check the source repository for the project you're interested in and follow the instructions for running without a Warrior in the README. This is particularly useful if you want to deploy on a machine where you don't have root.

We generally recommend that people use the Warrior, because it is simple for non-technical users to set up and it requires no supervision. You should only use these alternatives if you are comfortable with Linux and prepared to manually intervene when projects begin and end.

I'm looking at the leaderboard. What's that icon beside the username?

That's just the warrior logo: Archive team.png (click on the image for a larger version). It means that that person is using the warrior. Those without the icon are running the project manually.

Archiveteam-warrior-sticker.png

What's that guy doing in the logo?

The place is on fire! But don't worry, he safely escaped with the rescued data in his arms.

That’s awesome—can I slap this logo on my laptop to show my Internet-preservation pride?

You sure can! The ArchiveTeam Warrior laptop sticker can start conversations about archiving, if you’re into that.

I'd like to help write code or I want to tweak the scripts to run to my liking. Where can I find more info? Where is the source code and repository?

In order to ensure data accuracy, it is imperative that users contributing to Archive Team projects do not modify the project scripts. If you would like to propose improvements to be included in future official versions of/updates to project scripts or would like to use our code for non-Archive Team projects, check out the Dev documentation for details on the infrastructure and details of the source code layout.

I still have a question!

Check out the general FAQ page. Talk to us on IRC. Use #warrior for specific warrior questions or #archiveteam-bs for general questions.

Troubleshooting

I'm getting errors when I try to launch the VM.

If you are receiving Breakpoint has been reached (0x80000003), A critical error has occurred while running the virtual machine and the machine execution has been stopped., or VT-X errors, you probably do not have virtualization enabled, either because it is turned off in your computer's BIOS or your CPU does not support it.

You can check CPU support on Linux with cat /proc/cpuinfo | grep "(vmx|svm)" | uniq. If there is a line of output starting with "flags", your processor supports virtualization; if there is no output, it does not. You can check whether virtualization is enabled in the BIOS using the rdmsr utility in your distro's msr-tools package.

You can check support and BIOS status on Windows using Microsoft's Hardware-Assisted Virtualization Detection Tool or VirtualChecker.

To enable virtualization on a CPU with support, reboot the computer and enter the BIOS. The virtualization setting is usually under something like 'CPU configuration' or 'advanced settings'.

I can't connect to localhost.

Virtual machine

The application is configured to set up port forwarding to the guest machine, and you should be able to access the interface through your web browser at port 8001. If this does not happen, and isn't resolved by rebooting the warrior (using the ACPI power signals, not suspend/save state and resume), you may need to double-check your machine's network settings (as described above).

Docker container

Make sure you invoked docker run with the option --publish. To access the web interface at http://localhost:X/, you must use --publish X:8001.

The warrior can't connect to the internet.

This may manifest as the following error:

Checking Internet wget: bad address 'warriorhq.archiveteam.org' Unable to access the Internet

It's possible that the virtual machine has picked up the address of the local DNS cache on your computer, which the virtual machine does not have access to.

If you experience this on VirtualBox, see this question and answer. Additionally, check to see if "Cable Connected" is unchecked in the advanced settings of the virtual adapter, under the network tab in the virtual machine's settings. Check it if it's unchecked, then save your settings.

Another option is to switch to "Host bridge" under settings of the network adapter. If you do this, you won't be able to connect to 127.0.0.1, instead use the first IP in the list below (without the /32, with :8001 at the end).

I see a message that no item was received.

This means that there is no work available. This can happen for several reasons:

  • The project has just finished and someone is inspecting the work done. If a problem is discovered, items may be re-queued and more work will become available.
  • You have checked out/claimed too many items. Reduce your concurrency and let others do some of the work too.
  • In a rare case, you have been banned by a tracker administrator because there was a problem with your work: you were requesting too much, you were tampering with the scripts, a malfunction has occurred, or your internet connection is "unclean" (see above).

I see a message about rate limiting.

Don't worry. Keep in mind that although downloading the internet for fun and digital preservation are the primary goals of all Archive Team activities, serious stress on the target's server may occur. The rate limit is imposed by a tracker administrator and should not be subverted.

(In other words, we don't want to DDoS the servers.)

If you like, you can switch to another project with less load.

I see a message about code being out of date.

Don't worry. There is a new update ready. You do not need to do anything about this; the Warrior will update its code every hour. If you are impatient, please reboot the warrior and it will download the latest code and resume work.

I'm running a project manually and I see a message about code being out of date.

This happens when a bug in the scripts is discovered. Bugs are unavoidable, especially when the server is out of our control.

If you are running the scripts using Docker, we recommend using Watchtower to check for updates every hour, downloading and installing them when necessary. See the setup instructions in Running Archive Team Projects with Docker for more details.

If you are not running the scripts using the provided Docker images, try the --auto-update option available in Seesaw version 0.8. However, please be aware that you are now executing code automatically. Be sure to run the scripts in a separate user account for safety.

I see messages about rsync errors.

Uh-oh! Something is not right. Please notify us immediately in the appropriate IRC channel.

I told the warrior to shut down from the interface, but nothing has changed.

The warrior will attempt to finish the current running tasks before shutting down. If you need to shut down right away, go ahead. Your progress will be lost, but the jobs will eventually cycle out to another user.

The warrior is eating all my bandwidth!

Virtual machine

On VirtualBox (relatively recent versions), use this command:

VBoxManage bandwidthctl archiveteam-warrior-3.2 add limit --type network --limit 3m

This will limit the warrior to 3Mb/s. (Limit units are k for kilobit, m for megabit, g for gigabit, K for kilobyte, M for megabyte, and G for gigabyte.) Adjust as required. :)

In the latest version of VirtualBox on Windows, the syntax appears to have changed. The correct command now seems to be:

VBoxManage bandwidthctl archiveteam-warrior-3.2 add netlimit --type network --limit 3

For more information, consult the VirtualBox manual (Chapter 6, Section 9).

On VMware (versions 9 and above), select a virtual machine and open Settings > Hardware > Virtual Network Adapter > Advanced. You can set a bandwidth limit here.

Docker container

You're out of luck; Docker has no feature for limiting bandwidth.

The Warrior virtual machine is using up disk space, even though it's not running a project!

Virtual machine disk images do not behave like a regular file. There are several ways to safely reclaim space:

  • Delete the entire warrior application and re-import it.
  • Use the VirtualBox CLI to compact the disk. First, shut down the VM. Then, open a terminal and navigate to the folder where the hard-disk VDI file is stored. Finally, run VBoxManage modifymedium --compact archiveteam-warrior-v3.2-20210306-disk001.vdi, replacing archiveteam-warrior-v3.2-20210306-disk001.vdi with the name of the VDI file in use by the Warrior VM. See the VirtualBox documentation for more details and additional steps to help achieve a better result.
  • Use the zerofree program and then clone the disk image. Reattach the cloned disk image.

This issue should not affect Docker containers.

My Warrior crashed, or I had to hard-stop it, and I don't think there's time to retry the tasks.

Please notify us in the project IRC channel, including for stops due to system failures and power outages as well as hard-kills. Do not attempt to start or restart the affected containers. If time is indeed a concern, we can help you save or recover partial data from your Warrior.

(The same applies if your Warrior is still running, but tasks are now stuck—especially if they're stuck because the project has reached its deadline and the target site is now gone. Most projects should handle this gracefully, but contact us if they do not.)

The item I'm working on is downloading thousands of URLs and it's taking hours.

Please notify us in the appropriate IRC channel. You may need to reboot the Warrior.

Why is the default project not working? / Why is a manual project not in the Warrior yet?

Sorry. Sometimes the administrators are too busy...

Why are there no projects?

We finished the ones we were working on! If there are no projects showing, you can help us write one. No projects does not mean there is nothing left to archive!

The instructions to run the software/scripts are awful and they are difficult to set up.

Well, excuuuuse me, princess!

We're not a professional support team so help us help you help us all. See above for bug reports, suggestions, or code contributions.

Where can I file a bug, suggestion, or a feature request?

If the issue is related to the warrior's web interface or the library that grab scripts are using, see seesaw-kit issues. Other issues should be filed into their own repositories.

Projects

See Warrior projects.

Are you a coder?

Like the Warrior? Interested in how it works under the hood? Got software skills? Help us improve it!