Difference between revisions of "ArchiveTeam Warrior"

From Archiveteam
Jump to navigation Jump to search
(Major update to FAQ, add Docker information to FAQ)
(Make Docker command explanations collapsed by default, add explanation of Watchtower image address)
Line 72: Line 72:
# Download and install Docker from the link above.
# Download and install Docker from the link above.
# Open your terminal. On Windows, you can use either Command Prompt (CMD) or PowerShell, on macOS and Linux you can use Terminal (Bash).
# Open your terminal. On Windows, you can use either Command Prompt (CMD) or PowerShell, on macOS and Linux you can use Terminal (Bash).
# First, we will set up the [https://containrrr.dev/watchtower/ Watchtower] container. Watchtower automatically checks for updates to Docker containers every hour, and if an update is found, it will gracefully shutdown your container, update it, and restart it.<br />Use the following command:<pre>docker run -d --name watchtower --restart=unless-stopped -v /var/run/docker.sock:/var/run/docker.sock containrrr/watchtower --label-enable --cleanup --interval 3600</pre>Explanation:
# First, we will set up the [https://containrrr.dev/watchtower/ Watchtower] container. Watchtower automatically checks for updates to Docker containers every hour, and if an update is found, it will gracefully shutdown your container, update it, and restart it.<br />Use the following command:<pre>docker run -d --name watchtower --restart=unless-stopped -v /var/run/docker.sock:/var/run/docker.sock containrrr/watchtower --label-enable --cleanup --interval 3600</pre>Command Explanation: <div class="mw-collapsible mw-collapsed" style="overflow:auto;">
#* <code>-d</code>: Detaches the container from the terminal and runs it in the background.
#* <code>-d</code>: Detaches the container from the terminal and runs it in the background.
#* <code>--name watchtower</code>: The name that is displayed for the container. A name other than "watchtower" can be specified here if needed.
#* <code>--name watchtower</code>: The name that is displayed for the container. A name other than "watchtower" can be specified here if needed.
#* <code>--restart=unless-stopped</code>: This tells Docker to restart the container unless you stop it. This also means that it will restart the container automatically when you reboot your system.
#* <code>--restart=unless-stopped</code>: This tells Docker to restart the container unless you stop it. This also means that it will restart the container automatically when you reboot your system.
#* <code>-v /var/run/docker.sock:/var/run/docker.sock</code>: This provides the Watchtower container access to your system's Docker socket. Watchtower uses this to communicate with Docker on your system to gracefully shutdown and update your containers.
#* <code>-v /var/run/docker.sock:/var/run/docker.sock</code>: This provides the Watchtower container access to your system's Docker socket. Watchtower uses this to communicate with Docker on your system to gracefully shutdown and update your containers.
#* <code>containrrr/watchtower</code>: This is the Docker image address for Watchtower.
#* <code>--label-enable</code>: This tells Watchtower only to update containers that are specifically tagged for auto-updating. This is included to prevent Watchtower from updating any other containers you may have running on your system. If you are only using Docker to run Archive Team projects, or wish to automatically update all containers including those that are not for Archive Team projects, you can leave this off.
#* <code>--label-enable</code>: This tells Watchtower only to update containers that are specifically tagged for auto-updating. This is included to prevent Watchtower from updating any other containers you may have running on your system. If you are only using Docker to run Archive Team projects, or wish to automatically update all containers including those that are not for Archive Team projects, you can leave this off.
#* <code>--cleanup</code>: This tells Watchtower to delete old, outdated Docker images, which helps save disk space on your system.
#* <code>--cleanup</code>: This tells Watchtower to delete old, outdated Docker images, which helps save disk space on your system.
#* <code>--interval 3600</code>: This tells Watchtower to check for updates to your Docker containers every hour.
#* <code>--interval 3600</code>: This tells Watchtower to check for updates to your Docker containers every hour.</div>
# Now we will set up the Warrior container.<br />Use the following command:<pre>docker run -d --name archiveteam-warrior --label=com.centurylinklabs.watchtower.enable=true --restart=unless-stopped -p 8001:8001 atdr.meo.ws/archiveteam/warrior-dockerfile</pre>Explanation:
# Now we will set up the Warrior container.<br />Use the following command:<pre>docker run -d --name archiveteam-warrior --label=com.centurylinklabs.watchtower.enable=true --restart=unless-stopped -p 8001:8001 atdr.meo.ws/archiveteam/warrior-dockerfile</pre>Command Explanation: <div class="mw-collapsible mw-collapsed" style="overflow:auto;">
#* <code>-d</code>: Detaches the container from the terminal and runs it in the background.
#* <code>-d</code>: Detaches the container from the terminal and runs it in the background.
#* <code>--name archiveteam-warrior</code>: The name that is displayed for the container. A name other than "archiveteam-warrior" can be specified here if needed (e.g. you want to create multiple containers using the same image).
#* <code>--name archiveteam-warrior</code>: The name that is displayed for the container. A name other than "archiveteam-warrior" can be specified here if needed (e.g. you want to create multiple containers using the same image).
Line 86: Line 87:
#* <code>--restart=unless-stopped</code>: This tells Docker to restart the container unless you stop it. This also means that it will restart the container automatically when you reboot your system.
#* <code>--restart=unless-stopped</code>: This tells Docker to restart the container unless you stop it. This also means that it will restart the container automatically when you reboot your system.
#* <code>-p 8001:8001</code>: This tells Docker to make port 8001 from the container available on your system at http://localhost:8001. This allows you to use your web browser to monitor and configure your Warrior.
#* <code>-p 8001:8001</code>: This tells Docker to make port 8001 from the container available on your system at http://localhost:8001. This allows you to use your web browser to monitor and configure your Warrior.
#* <code>atdr.meo.ws/archiveteam/warrior-dockerfile</code>: This is the Docker image address for the Warrior.
#* <code>atdr.meo.ws/archiveteam/warrior-dockerfile</code>: This is the Docker image address for the Warrior.</div>
# The Warrior will download and start up. It will automatically restart when your system is restarted unless you stop the container.
# The Warrior will download and start up. It will automatically restart when your system is restarted unless you stop the container.
# When the command finishes, use your regular web browser to visit http://localhost:8001/
# When the command finishes, use your regular web browser to visit http://localhost:8001/
Line 97: Line 98:


==== Stopping Docker containers ====
==== Stopping Docker containers ====
# '''Recommended method:''' Attempt graceful stop by sending the SIGINT signal, with no hard-kill deadline:<br><code>docker kill --signal=SIGINT archiveteam-warrior</code><br>Explanation:
# '''Recommended method:''' Attempt graceful stop by sending the SIGINT signal, with no hard-kill deadline:<br><code>docker kill --signal=SIGINT archiveteam-warrior</code><br>Command Explanation: <div class="mw-collapsible mw-collapsed" style="overflow:auto;">
#* <code>kill</code>: Docker's command for killing a container, defaults to sending a SIGKILL signal unless otherwise specified
#* <code>kill</code>: Docker's command for killing a container, defaults to sending a SIGKILL signal unless otherwise specified
#* <code>--signal=SIGINT</code>: tells Docker to send a SIGINT signal to the container (not a SIGKILL)<br>
#* <code>--signal=SIGINT</code>: tells Docker to send a SIGINT signal to the container (not a SIGKILL)<br>
#* <code>archiveteam-warrior</code>: This is the name of the Docker container(s) that need to be stopped. If needed, replace with the actual container name(s) you want to stop. Multiple containers can be stopped with the same command.<br>
#* <code>archiveteam-warrior</code>: This is the name of the Docker container(s) that need to be stopped. If needed, replace with the actual container name(s) you want to stop. Multiple containers can be stopped with the same command.<br></div>
# '''Alternate, unrecommended method:''' Attempt stop, with a hard-kill deadline of 1 hour:<br><code>docker stop -t 3600 archiveteam-warrior</code><br>Explanation:
# '''Alternate, unrecommended method:''' Attempt stop, with a hard-kill deadline of 1 hour:<br><code>docker stop -t 3600 archiveteam-warrior</code><br>Command Explanation: <div class="mw-collapsible mw-collapsed" style="overflow:auto;">
#* <code>-t 3600</code>: tells Docker to wait for 3600 seconds (60 minutes) before forcibly stopping the container. Docker's default is <code>-t 10</code> (not recommended). Use <code>-t 0</code>to stop immediately (also not recommended). Hard-kill deadlines are problematic because large multi-GB projects may require long-running jobs (e.g. 48 hours for content download + additional hours of rsync upload time that itself may be delayed by upload bandwidth limits and/or congestion on the rsync target). Please ask in the project [[Archiveteam:IRC|IRC]] channel if you are considering using a hard-kill method, especially for projects where there may not be time for another worker to retry later. (There may be interest in recovering/saving partial WARCs from containers that did not end gracefully.) Also see the FAQ entry about ungraceful stops.<br>
#* <code>-t 3600</code>: tells Docker to wait for 3600 seconds (60 minutes) before forcibly stopping the container. Docker's default is <code>-t 10</code> (not recommended). Use <code>-t 0</code>to stop immediately (also not recommended). Hard-kill deadlines are problematic because large multi-GB projects may require long-running jobs (e.g. 48 hours for content download + additional hours of rsync upload time that itself may be delayed by upload bandwidth limits and/or congestion on the rsync target). Please ask in the project [[Archiveteam:IRC|IRC]] channel if you are considering using a hard-kill method, especially for projects where there may not be time for another worker to retry later. (There may be interest in recovering/saving partial WARCs from containers that did not end gracefully.) Also see the FAQ entry about ungraceful stops.<br>
#* <code>archiveteam-warrior</code>: This is the name of the Docker container(s) that need to be stopped. If needed, replace with the actual container name(s) you want to stop. Multiple containers can be stopped with the same command.<br>
#* <code>archiveteam-warrior</code>: This is the name of the Docker container(s) that need to be stopped. If needed, replace with the actual container name(s) you want to stop. Multiple containers can be stopped with the same command.<br></div>


The same commands can also be used to stop the <code>watchtower</code> container.
The same commands can also be used to stop the <code>watchtower</code> container.

Revision as of 20:54, 6 April 2021

Archiveteam1.png Version 3.2 of the Warrior VM Appliance has been released! This update enables running newer projects, shortens startup times, enables viewing basic logs from the virtual machine console (press ALT+F2 for Warrior logs, press ALT+F3 for automatic updater logs, and press ALT+F1 to return to the splash screen), and has other minor improvements. Warriors versions 3.0 and 3.1 will automatically update themselves with the project compatibility improvements, but the other improvements require re-creating the VM with version 3.2 of the appliance.

The Warrior should now be compatible with most projects; however some projects may still not be compatible and show a blank screen when attempting to run them.

In place of the Warrior, you may manually run projects using Docker. (If you like, you can run Docker in a VM of your own choosing. Recent Ubuntu versions are known to work.) For further info, see our guide to Running Archive Team Projects with Docker and also see the project's Readme instructions in the ArchiveTeam GitHub repositories. If you have any issues or feedback, see the AT #warrior IRC channel on hackint.

What is the Archive Team Warrior?

Archive team.png
Warrior-vm-screenshot.png
Warrior-web-screenshot.png

The Archive Team Warrior is a virtual archiving appliance. You can run it to help with the Archive Team archiving efforts. It will download sites and upload them to our archive — and it’s really easy to do!

The warrior is a virtual machine, so there is no risk to your computer. The warrior will only use your bandwidth and some of your disk space. It will get tasks from and report progress to the Tracker.

Basic usage

The Warrior runs on Windows, macOS, and Linux. You can run it using a virtual machine (simplest) or using Docker (far less overhead than a VM, a little more complicated to set up but still fairly simple).

Using a Virtual Machine

You'll need:

Plus one of the virtualization applications below to run it:

Quick start instructions for VirtualBox

A video demonstrating these steps is available. (Note that the screen indicating that the Warrior has finished loading looks different than the one from when this video was made, but the steps are otherwise the same.)

  1. Download the appliance from the link above.
  2. Launch VirtualBox
  3. In VirtualBox, click File > Import Appliance and open the file.
  4. Start the virtual machine.
    • It will fetch the latest updates and will eventually tell you to start your web browser.
  5. Using your regular web browser, visit http://localhost:8001/
  6. On the left, click "Your settings".
  7. Choose a username - we'll show your progress on the leaderboard.
  8. On the left, click "Available projects" tab and pick a project to work on.
    • Even better: select "ArchiveTeam's Choice" to let your warrior work on the most urgent project.

Start instructions for VMWare Player

Note that VMWare Player may have some compatibility issues with running the Warrior image.

  1. Download the appliance from the link above
  2. Launch VMWare Player
  3. In Player on the right, click "Open Virtual Machine", open the file and import the virtual machine.
  4. (Optional) Select the virtual machine and click "Edit virtual machine settings".
    • Select Network Adapter and set it to "Bridged: Connected directly to the physical network"
  5. Start the virtual machine.
    • It will fetch the latest updates and will eventually tell you to start your web browser.
  6. Using your regular web browser, visit the address that is shown on the bottom (e.g. http://192.168.0.100:8001/)
  7. On the left, click "Your settings".
  8. Choose a username - we'll show your progress on the leaderboard.
  9. On the left, click "Available projects" tab and pick a project to work on.
    • Even better: select "ArchiveTeam's Choice" to let your warrior work on the most urgent project.

Using Docker

Docker runs on Windows, macOS, and Linux, and is a free download. Docker runs code in containers, and stores code in images. (Docker requires the professional version of Windows if being run on versions of Windows prior to Windows 10 version 1903.)

Instructions for running the Warrior using Docker are below. For additional FAQ and instructions for running individual project scripts using Docker, see Running Archive Team Projects with Docker.

Instructions for using Docker CLI on Windows, macOS, or Linux

  1. Download and install Docker from the link above.
  2. Open your terminal. On Windows, you can use either Command Prompt (CMD) or PowerShell, on macOS and Linux you can use Terminal (Bash).
  3. First, we will set up the Watchtower container. Watchtower automatically checks for updates to Docker containers every hour, and if an update is found, it will gracefully shutdown your container, update it, and restart it.
    Use the following command:
    docker run -d --name watchtower --restart=unless-stopped -v /var/run/docker.sock:/var/run/docker.sock containrrr/watchtower --label-enable --cleanup --interval 3600
    Command Explanation:
    • -d: Detaches the container from the terminal and runs it in the background.
    • --name watchtower: The name that is displayed for the container. A name other than "watchtower" can be specified here if needed.
    • --restart=unless-stopped: This tells Docker to restart the container unless you stop it. This also means that it will restart the container automatically when you reboot your system.
    • -v /var/run/docker.sock:/var/run/docker.sock: This provides the Watchtower container access to your system's Docker socket. Watchtower uses this to communicate with Docker on your system to gracefully shutdown and update your containers.
    • containrrr/watchtower: This is the Docker image address for Watchtower.
    • --label-enable: This tells Watchtower only to update containers that are specifically tagged for auto-updating. This is included to prevent Watchtower from updating any other containers you may have running on your system. If you are only using Docker to run Archive Team projects, or wish to automatically update all containers including those that are not for Archive Team projects, you can leave this off.
    • --cleanup: This tells Watchtower to delete old, outdated Docker images, which helps save disk space on your system.
    • --interval 3600: This tells Watchtower to check for updates to your Docker containers every hour.
  4. Now we will set up the Warrior container.
    Use the following command:
    docker run -d --name archiveteam-warrior --label=com.centurylinklabs.watchtower.enable=true --restart=unless-stopped -p 8001:8001 atdr.meo.ws/archiveteam/warrior-dockerfile
    Command Explanation:
    • -d: Detaches the container from the terminal and runs it in the background.
    • --name archiveteam-warrior: The name that is displayed for the container. A name other than "archiveteam-warrior" can be specified here if needed (e.g. you want to create multiple containers using the same image).
    • --label=com.centurylinklabs.watchtower.enable=true: Labels the container to be automatically updated by Watchtower. You can leave this off if you did not include --label-enable when launching the Watchtower container.
    • --restart=unless-stopped: This tells Docker to restart the container unless you stop it. This also means that it will restart the container automatically when you reboot your system.
    • -p 8001:8001: This tells Docker to make port 8001 from the container available on your system at http://localhost:8001. This allows you to use your web browser to monitor and configure your Warrior.
    • atdr.meo.ws/archiveteam/warrior-dockerfile: This is the Docker image address for the Warrior.
  5. The Warrior will download and start up. It will automatically restart when your system is restarted unless you stop the container.
  6. When the command finishes, use your regular web browser to visit http://localhost:8001/
  7. On the left, click "Your settings".
  8. Choose a username - we'll show your progress on the leaderboard.
  9. On the left, click "Available projects" tab and pick a project to work on.
    • Even better: select "ArchiveTeam's Choice" to let your warrior work on the most urgent project.
Archiveteam1.png On Windows and macOS, once you have completed steps 1-4, you can also start, stop, and delete containers in the Docker Desktop UI. However, for the time being, initial setup can only be done from the command line. Docker on Linux (either in a VM or on bare metal hardware) is the recommended way to run Docker containers.

Stopping Docker containers

  1. Recommended method: Attempt graceful stop by sending the SIGINT signal, with no hard-kill deadline:
    docker kill --signal=SIGINT archiveteam-warrior
    Command Explanation:
    • kill: Docker's command for killing a container, defaults to sending a SIGKILL signal unless otherwise specified
    • --signal=SIGINT: tells Docker to send a SIGINT signal to the container (not a SIGKILL)
    • archiveteam-warrior: This is the name of the Docker container(s) that need to be stopped. If needed, replace with the actual container name(s) you want to stop. Multiple containers can be stopped with the same command.
  2. Alternate, unrecommended method: Attempt stop, with a hard-kill deadline of 1 hour:
    docker stop -t 3600 archiveteam-warrior
    Command Explanation:
    • -t 3600: tells Docker to wait for 3600 seconds (60 minutes) before forcibly stopping the container. Docker's default is -t 10 (not recommended). Use -t 0to stop immediately (also not recommended). Hard-kill deadlines are problematic because large multi-GB projects may require long-running jobs (e.g. 48 hours for content download + additional hours of rsync upload time that itself may be delayed by upload bandwidth limits and/or congestion on the rsync target). Please ask in the project IRC channel if you are considering using a hard-kill method, especially for projects where there may not be time for another worker to retry later. (There may be interest in recovering/saving partial WARCs from containers that did not end gracefully.) Also see the FAQ entry about ungraceful stops.
    • archiveteam-warrior: This is the name of the Docker container(s) that need to be stopped. If needed, replace with the actual container name(s) you want to stop. Multiple containers can be stopped with the same command.

The same commands can also be used to stop the watchtower container.

Starting Docker containers

Similarly, to start your containers again in the future, run docker start watchtower archiveteam-warrior. If needed, replace "watchtower" and "archiveteam-warrior" with the actual container names you used.

Deleting Docker containers

To delete a container, run docker rm archiveteam-warrior. If needed, replace "archiveteam-warrior" with the name of the actual container you want to delete. To free up disk space, you can also purge your unused Docker images by running docker image prune. Note that this command will delete all Docker images on your system that are not associated with a container, not just Archive Team ones.


Warrior FAQ

For additional FAQ for running the Warrior using Docker, see Running Archive Team Projects with Docker.

Why a virtual machine/Docker container in the first place?

The virtual machine is a quick, safe, and easy way for newcomers to help us out. It offers many features:

  • Graphical interface (Virtual Machine only)
  • Automatically selects which project is important to run
  • Self-updating software infrastructure
  • Allows for unattended use
  • In case of software faults, your machine is not ruined
  • Restarts itself in case of runaway programs
  • Runs on Windows, Mac, and Linux painlessly
  • Ensures consistency in the archived data regardless of your machine's quirks
  • Restarts automatically after a system restart (Docker container only)

If you have suggestions for improving this system, please talk to us as described below.

Can I use whatever internet access for the Warrior?

No. We need "clean" connections. Please ensure the following:

  • No OpenDNS. No ISP DNS that redirects to a search page. Use non-captive DNS servers.
  • No ISP connections that inject advertisements into web pages.
  • No proxies. Proxies can return bad data. The original HTTP headers and IP address are needed for the WARC file.
  • No content-filtering firewalls.
  • No censorship. If you believe your country implements censorship, do not run a warrior.
  • No Tor. The server may return an error page instead of content if they ban exit nodes.
  • No free cafe wifi. Archiving your cafe's wifi service agreement repeatedly is not helpful.
  • No VPNs. Data integrity is a very high priority for the Archive Team so use of VPNs with the official crawler is discouraged.
  • We prefer connections from many public IP addresses if possible. (For example, if your apartment building uses a single IP address, we don't want your apartment banned.)

I turned my Warrior VM/Docker appliance off. Will those tasks be lost?

If you've killed your Warrior VM/Docker instance, then the work your Warrior did has been lost. However, the tasks will be returned to the pool after a period of time, and other warriors may claim them.

I closed my browser or tab with the Warrior's web interface. Will those tasks be lost?

No. The web browser interface just provides a user interface to the Warrior. As long as the VM or Docker container is not stopped, it will continue normally.

I need to disconnect my internet / reboot my PC. How can I do this without losing work?

If you pause/suspend the Warrior VM instance, most projects will allow resuming of work in progress when you unsuspend the warrior VM instance.

If you decide to use the suspend VM feature, please note that if you keep it suspended for too long (more than a few hours), the administrators will assume that the item is lost and re-queue it. Using the suspend feature so that you can reboot your computer is perfectly fine.

Docker does not have a feature for suspending containers. If you want to disconnect your internet or reboot your PC without losing work, try to gracefully stop the container by using the recommended method for stopping Docker containers, and then start the container again when you are ready to resume work.

How much disk space will the warrior use?

Short answer: it depends on the project. The virtual machine has a hard limit of 60GB disk usage, but the Docker container does not have such a limit. However, it is highly unlikely that any project would use more than 60GB of disk space at any time.

Long answer: because each project defines items differently, sizes may vary. A single task may be a small file or a whole subsection of a website. The virtual machine is configured by default to use an absolute maximum of 60GB, but Docker has no hard limit. Any unused virtual machine/Docker container disk space is not used on the host computer. You may run the virtual machine on less than 60GB if you like to live dangerously. We're downloading the internet, after all!

How can I log into the virtual machine/Docker container?

Unless you know what you are doing, you should not need to do this.

To log into the virtual machine, start up the Warrior VM and wait for it to finish booting with the screen showing "The warrior has successfully started up". Press ALT+F4 to switch to virtual console number 4. VirtualBox users may need to press the host key, RIGHT_CONTROL to enter capture mode before pressing ALT+F4. Use ALT+Left or ALT+Right to switch between virtual consoles. There are 6 virtual consoles in total. Consoles 1, 2, and 3 are reserved for the warrior. Switching to a new virtual console will show a login shell. You can login using the username root and the password archiveteam.

To bring up a command shell inside the Docker container, open your terminal and run sudo docker exec -t -i archiveteam-warrior /bin/bash. Replace 'archiveteam-warrior' with the name of your Warrior container if necessary.

How can I run multiple virtual machines/Docker containers at the same time?

If you want to run multiple instances of a project within the same Warrior VM/Docker container, you can adjust the number of concurrent items to work on in your Warrior settings. The minimum concurrency value is 1, the default concurrency value is 3, and maximum recommended concurrency value is 5, and the maximum allowed concurrency value is 20.

If you still want to run multiple virtual machines, you'll need to adjust the networking settings.

In VirtualBox, select a virtual machine and open up Settings → Network → Adapter 1 → Port Forwarding. You need to adjust the host port. For example, set your table to TCP | 127.0.0.1 | 8123 | | 8001. This maps port 8123 on the host machine (your computer) to port 8001 on the virtual machine (the warrior), and you can then access the warrior's web interface from port 8123 in your browser.

Each VM you want to access should have a different host port. Do not use port numbers below 1024 unless you know what you are doing.

VMWare installations should be using bridged networking. However, if you want, you can switch to NAT (under Settings → Hardware → Virtual Network Adapter) and click Edit to set up port forwarding. On Linux, you can also use lines like 8123 = 192.168.0.100:8001 in the [incomingtcp] section of nat.conf. (Make sure the VM IP is correct!)

If you want to run multiple Docker containers, you'll need to adjust the run command used to create them. First, each container needs a unique name, so you will need to replace the name specified with the --name parameter with something unique. Second, you will need to specify a unique port to access the web interface of each container. You can do this by changing the number before the : in the -p parameter to any available unique port number equal to or greater than 1024. (Additional options for specifying ports are explained in the Docker documentation.) Third, you may also want to reuse your configuration between different Docker containers. You can do this by specifying the same environment variables or bindmounting the same config.json file across all of your containers. See the Warrior Dockerfile README for more details about this.

How can I run the virtual machine/Docker container headlessly (without leaving a window open)?

From the VirtualBox GUI, after opening the VM, click Machine > Detach GUI. You can then close the VirtualBox Manager window.

For the VirtualBox CLI, use this command:

vboxmanage startvm archiveteam-warrior-3.2 --type headless

Shut down the VM with:

vboxmanage controlvm archiveteam-warrior-3.2 acpipowerbutton

Substituting suspend or resume for acpipowerbutton suspends or resumes the VM. For more information, consult the VirtualBox manual (Chapter 8, Sections 12 and 13).

For the VMWare CLI, use this command:

vmrun start <path to vmx file> nogui

Shut down with:

vmrun stop <path to vmx file> soft

Substituting suspend for stop suspends the VM. Resume with start again. For more information, including the paths to VMX files on different operating systems, consult Using vmrun to Control Virtual Machines (PDF), pages 10 and 11.

The Docker container runs headlessly by default with no need for additional configuration.

How can I set up the virtual machine/Docker container as a system service (so that it starts up on boot and shuts down automatically)?

If you are using VirtualBox and running a Linux distribution that uses the systemd init system (like most recent releases), you can follow the short instructions on this page. (The page title specifies Arch Linux, but this will work for other distros as long as they run systemd.)

The Docker container runs starts on boot and shuts down automatically by default with no need for additional configuration.

How can I set up the virtual machine with directly-bridged networking instead of NAT?

On VirtualBox, use these commands:

vboxmanage modifyvm archiveteam-warrior-2 --nic1 bridged
vboxmanage modifyvm archiveteam-warrior-2 --bridgeadapter1 eth0

We presume you want to bind to eth0. Adjust as required. :)

VMWare installations should already be using bridged networking.

How can I access the virtual machine from another device on my network?

Full guide for VirtualBox users is found here

How can I run tons of warriors easily?

We assume you've checked with the current Archive Team project what concurrency and resources are needed or useful!

Whether your have your own virtual cluster or you're renting someone else's (aka a "cloud"), you probably need some orchestration software.

Archive Team volunteers have successfully used a variety of hosting providers and tools (including free trials on AWS and GCE), often just by building their own flavor of virtual server and then repeating it with simple cloud-init scripts (to install and launch docker as above) or whatever tool the hosting provides. If you desire full automation, the archiveteam-infra repository by diggan helps with Terraform on DigitalOcean.

Some custom monitoring scripts also exist, for instance watcher.py.

You can also review instructions for running multiple Warrior VMs/Docker containers on one machine which may also be helpful here.

I'm looking at the leaderboard. What's that icon beside the username?

That's just the warrior logo: Archive team.png (click on the image for a larger version). It means that that person is using the warrior. Those without the icon are running the scripts manually.

Archiveteam-warrior-sticker.png

What's that guy doing in the logo?

The place is on fire! But don't worry, he safely escaped with the rescued data in his arms.

That’s awesome – can I slap this logo on my laptop to show my Internet-preservation pride?

You sure can! The ArchiveTeam Warrior laptop sticker can start conversations about archiving, if you’re into that.

I'd like to help write code or I want to tweak the scripts to run to my liking. Where can I find more info? Where is the source code and repository?

In order to ensure data accuracy, it is imperative that users contributing to Archive Team projects do not modify the project scripts. If you would like to propose improvements to be included in future official versions of/updates to project scripts or would like to use our code for non-Archive Team projects, check out the Dev documentation for details on the infrastructure and details of the source code layout.

I still have a question!

Check out the general FAQ page. Talk to us on IRC. Use #warrior for specific warrior questions or #archiveteam-bs for general questions.

Troubleshooting

I'm getting errors when I try to launch the VM.

If you are receiving Breakpoint has been reached (0x80000003), A critical error has occurred while running the virtual machine and the machine execution has been stopped., or VT-X errors, you probably do not have virtualization enabled, either because it is turned off in your computer's BIOS or your CPU does not support it.

You can check CPU support on Linux with cat /proc/cpuinfo | grep "(vmx|svm)" | uniq. If there is a line of output starting with "flags", your processor supports virtualization; if there is no output, it does not. You can check whether virtualization is enabled in the BIOS using the rdmsr utility in your distro's msr-tools package.

You can check support and BIOS status on Windows using Microsoft's Hardware-Assisted Virtualization Detection Tool or VirtualChecker.

To enable virtualization on a CPU with support, reboot the computer and enter the BIOS. The virtualization setting is usually under something like 'CPU configuration' or 'advanced settings'.

I can't connect to localhost.

The application is configured to set up port forwarding to the guest machine, and you should be able to access the interface through your web browser at port 8001. If this does not happen, and isn't resolved by rebooting the warrior (using the ACPI power signals, not suspend/save state and resume), you may need to double-check your machine's network settings (as described above).

The warrior can't connect to the internet.

It's possible that the virtual machine has picked up the address of the local DNS cache on your computer, which the virtual machine does not have access to.

If you experience this on VirtualBox, see this question and answer. Additionally, check to see if "Cable Connected" is unchecked in the advanced settings of the virtual adapter, under the network tab in the virtual machine's settings. Check it if it's unchecked, then save your settings.

I see a message that no item was received.

This means that there is no work available. This can happen for several reasons:

  • The project has just finished and someone is inspecting the work done. If a problem is discovered, items may be re-queued and more work will become available.
  • You have checked out/claimed too many items. Reduce your concurrency and let others do some of the work too.
  • In a rare case, you have been banned by a tracker administrator because there was a problem with your work: you were requesting too much, you were tampering with the scripts, a malfunction has occurred, or your internet connection is "unclean" (see above).

I see a message about rate limiting.

Don't worry. Keep in mind that although downloading the internet for fun and digital preservation are the primary goals of all Archive Team activities, serious stress on the target's server may occur. The rate limit is imposed by a tracker administrator and should not be subverted.

(In other words, we don't want to DDoS the servers.)

If you like, you can switch to another project with less load.

I see a message about code being out of date.

Don't worry. There is a new update ready. You do not need to do anything about this; the Warrior will update its code every hour. If you are impatient, please reboot the warrior and it will download the latest code and resume work.

I'm running the scripts manually and I see a message about code being out of date.

This happens when a bug in the scripts is discovered. Bugs are unavoidable, especially when the server is out of our control.

If you are running the scripts using Docker, we recommend using Watchtower to check for updates every hour, downloading and installing them when necessary. See the setup instructions in Running Archive Team Projects with Docker for more details.

If you are not running the scripts using the provided Docker images, try the --auto-update option available in Seesaw version 0.8. However, please be aware that you are now executing code automatically. Be sure to run the scripts in a separate user account for safety.

I see messages about rsync errors.

Uh-oh! Something is not right. Please notify us immediately in the appropriate IRC channel.

I told the warrior to shut down from the interface, but nothing has changed.

The warrior will attempt to finish the current running tasks before shutting down. If you need to shut down right away, go ahead. Your progress will be lost, but the jobs will eventually cycle out to another user.

The warrior is eating all my bandwidth!

On VirtualBox (relatively recent versions), use this command:

vboxmanage bandwidthctl archiveteam-warrior-3 add limit --type network --limit 3m

This will limit the warrior to 3Mb/s. (Limit units are k for kilobit, m for megabit, g for gigabit, K for kilobyte, M for megabyte, and G for gigabyte.) Adjust as required. :)

In the latest version of VirtualBox on Windows, the syntax appears to have changed. The correct command now seems to be:

VBoxManage bandwidthctl archiveteam-warrior-3 add netlimit --type network --limit 3

For more information, consult the VirtualBox manual (Chapter 6, Section 9).

On VMWare (versions 9 and above), select a virtual machine and open Settings → Hardware → Virtual Network Adapter → Advanced. You can set a bandwidth limit here.

Docker has no feature for limiting bandwidth.

The Warrior virtual machine is using up disk space, even though it's not running a project!

Virtual machine disk images do not behave like a regular file. There are several ways to safely reclaim space:

  • Delete the entire warrior application and re-import it.
  • Use the VirtualBox CLI to compact the disk. First, shut down the VM. Then, open a terminal and navigate to the folder where the hard-disk VDI file is stored. Finally, run VBoxManage modifymedium --compact archiveteam-warrior-v3.2-20210306-disk001.vdi, replacing archiveteam-warrior-v3.2-20210306-disk001.vdi with the name of the VDI file in use by the Warrior VM. See the VirtualBox documentation for more details and additional steps to help achieve a better result.
  • Use the zerofree program and then clone the disk image. Reattach the cloned disk image.

This issue should not affect Docker containers.

Recovering from a ungraceful virtual machine/Docker container stop

Please ask in the project IRC channel if some of your VMs or containers were stopped ungracefully. This includes using a container stop that used a hard-kill, also stops due to system failures or power outages. This is especially important for projects where there may not be enough time for another worker to retry later. Do not attempt to start/restart the affected containers. (Note: it is possible to recover/save partial WARCs using docker cp archiveteam:/grab/ ./ or similar from still running containers that are about to be terminated.)

The item I'm working on is downloading thousands of URLs and it's taking hours.

Please notify us in the appropriate IRC channel. You may need to reboot the Warrior.

Why is the default project not working? / Why is a manual project not in the Warrior yet?

Sorry. Sometimes the administrators are too busy...

Why are there no projects?

We finished the ones we were working on! If there are no projects showing, you can help us write one. No projects does not mean there is nothing left to archive!

The instructions to run the software/scripts are awful and they are difficult to set up.

Well, excuuuuse me, princess!

We're not a professional support team so help us help you help us all. See above for bug reports, suggestions, or code contributions.

Where can I file a bug, suggestion, or a feature request?

If the issue is related to the warrior's web interface or the library that grab scripts are using, see seesaw-kit issues. Other issues should be filed into their own repositories.

Projects

See Warrior projects.

Are you a coder?

Like the Warrior? Interested in how it works under the hood? Got software skills? Help us improve it!