Difference between revisions of "Running Archive Team Projects with Docker"

From Archiveteam
Jump to navigation Jump to search
(Add more Docker information)
(Advanced usage: Resource control via cgroup/systemd slices)
 
(27 intermediate revisions by 7 users not shown)
Line 8: Line 8:
You can run Archive Team scripts in Docker containers to help with our archiving efforts. It will download sites and upload them to our archive — and it’s really easy to do!
You can run Archive Team scripts in Docker containers to help with our archiving efforts. It will download sites and upload them to our archive — and it’s really easy to do!


The scripts run in a Docker container, so there is no risk to your computer. The container will only use your bandwidth and some of your disk space. It will get tasks from and report progress to the [[Tracker]].
The scripts run in a Docker container, so there is almost no security risk to your computer. "Almost", because in practice nothing is 100% secure. The container will mainly use some of your bandwidth and disk space, as well as some of your CPU and memory. It will get tasks from and report progress to the [[Tracker]].
 
__TOC__


== Basic usage ==
== Basic usage ==


Docker runs on Windows, macOS, and Linux, and is a [https://docs.docker.com/get-docker/ free download]. Docker runs code in '''containers''', and stores code in '''images'''.
Docker runs on Windows, macOS, and Linux, and is a [https://docs.docker.com/get-docker/ free download]. Docker runs code in '''containers''', and stores code in '''images'''. (Docker requires the professional version of Windows if being run on versions of Windows prior to Windows 10 version 1903.)


<!-- === Quick start instructions for Docker Desktop on Windows and macOS ===
<!-- === Quick start instructions for Docker Desktop on Windows and macOS ===
Line 31: Line 33:
# Download and install Docker from the link above.
# Download and install Docker from the link above.
# Open your terminal. On Windows, you can use either Command Prompt (CMD) or PowerShell, on macOS and Linux you can use Terminal (Bash).
# Open your terminal. On Windows, you can use either Command Prompt (CMD) or PowerShell, on macOS and Linux you can use Terminal (Bash).
# First, we will set up the [https://containrrr.dev/watchtower/ Watchtower] container. Watchtower automatically checks for updates to Docker containers every five minutes, and if an update is found, it will gracefully shutdown your container, update it, and restart it.<br />Use the following command:<br /><code>docker run -d --name watchtower --restart=unless-stopped -v /var/run/docker.sock:/var/run/docker.sock containrrr/watchtower --label-enable</code>.<br />Explanation:
# First, we will set up the [https://containrrr.dev/watchtower/ Watchtower] container. Watchtower automatically checks for updates to Docker containers every hour, and if an update is found, it will gracefully shutdown your container, update it, and restart it.<br />Use the following command:<pre>docker run -d --name watchtower --restart=unless-stopped -v /var/run/docker.sock:/var/run/docker.sock containrrr/watchtower --label-enable --cleanup --include-restarting --interval 3600</pre>Explanation:
#* <code>-d</code>: Detaches the container from the terminal and runs it in the background.
#* <code>-d</code>: Detaches the container from the terminal and runs it in the background.
#* <code>--name watchtower</code>: The name that is displayed for the container. A name other than "watchtower" can be specified here if needed.
#* <code>--name watchtower</code>: The name that is displayed for the container. A name other than "watchtower" can be specified here if needed.
#* <code>--restart=unless-stopped</code>: This tells Docker to restart the container unless you stop it.
#* <code>--restart=unless-stopped</code>: This tells Docker to restart the container unless you stop it. This also means that it will restart the container automatically when you reboot your system.
#* <code>-v /var/run/docker.sock:/var/run/docker.sock</code>: This provides the Watchtower container access to your system's Docker socket. Watchtower uses this to communicate with Docker on your system to gracefully shutdown and update your containers.
#* <code>-v /var/run/docker.sock:/var/run/docker.sock</code>: This provides the Watchtower container access to your system's Docker socket. Watchtower uses this to communicate with Docker on your system to gracefully shutdown and update your containers.
#* <code>containrrr/watchtower</code>: This is the Docker image address for Watchtower.
#* <code>--label-enable</code>: This tells Watchtower only to update containers that are specifically tagged for auto-updating. This is included to prevent Watchtower from updating any other containers you may have running on your system. If you are only using Docker to run Archive Team projects, or wish to automatically update all containers including those that are not for Archive Team projects, you can leave this off.
#* <code>--label-enable</code>: This tells Watchtower only to update containers that are specifically tagged for auto-updating. This is included to prevent Watchtower from updating any other containers you may have running on your system. If you are only using Docker to run Archive Team projects, or wish to automatically update all containers including those that are not for Archive Team projects, you can leave this off.
# Now we will set up a project container. You'll need to know the image address for the script for the project you want to help out with. If you don't know it, you can ask us on [[IRC]].<br />Use the following command:<br /><code>docker run -d --name archiveteam --label=com.centurylinklabs.watchtower.enable=true --restart=unless-stopped [image address] --concurrent 1 [username]</code>.<br />Explanation:
#* <code>--cleanup</code>: This tells Watchtower to delete old, outdated Docker images, which helps save disk space on your system.
#* <code>--include-restarting</code>: This tells Watchtower to include containers that are in the 'restarting' state. This is included to update a project container if it's caught in a crash-loop, as it wouldn't otherwise be updated.
#* <code>--interval 3600</code>: This tells Watchtower to check for updates to your Docker containers every hour.
# Now we will set up a project container. You'll need to know the image address for the script for the project you want to help out with. If you don't know it, you can ask us on [[Archiveteam:IRC|IRC]].<br />Use the following command:<pre>docker run -d --name archiveteam --label=com.centurylinklabs.watchtower.enable=true --restart=unless-stopped [image address] --concurrent 1 [username]</pre> For example, to assist with the [[Reddit]] project ({{IRC|shreddit}}):<pre>docker run -d --name archiveteam --label=com.centurylinklabs.watchtower.enable=true --restart=unless-stopped atdr.meo.ws/archiveteam/reddit-grab --concurrent 1 [username]</pre>Explanation:
#* <code>-d</code>: Detaches the container from the terminal and runs it in the background.
#* <code>-d</code>: Detaches the container from the terminal and runs it in the background.
#* <code>--name archiveteam</code>: The name that is displayed for the container. A name other than "archiveteam" can be specified here if needed.
#* <code>--name archiveteam</code>: The name that is displayed for the container. A name other than "archiveteam" can be specified here if needed (e.g. you want to create multiple containers using the same image).
#* <code>--label=com.centurylinklabs.watchtower.enable=true</code>: Labels the container to be automatically updated by Watchtower. You can leave this off if you did not include <code>--label-enable</code> when launching the Watchtower container.
#* <code>--label=com.centurylinklabs.watchtower.enable=true</code>: Labels the container to be automatically updated by Watchtower. You can leave this off if you did not include <code>--label-enable</code> when launching the Watchtower container.
#* <code>--restart=unless-stopped</code>: This tells Docker to restart the container unless you stop it.
#* <code>--restart=unless-stopped</code>: This tells Docker to restart the container unless you stop it. This also means that it will restart the container automatically when you reboot your system.
#* <code>[image address]</code>: Replace this with the image address for the project you would like to help with. The brackets should not be included in the final command.
#* <code>[image address]</code>: Replace this with the image address for the project you would like to help with. The brackets should not be included in the final command. Additionally, the address should not include <code>https://</code> or <code>http://</code>, and all characters must be lowercase. Most project images will be made available at 'atdr.meo.ws/archiveteam/$repo-grab' where $repo is the same name as used on code repository. E.g. The code at https://github.com/ArchiveTeam/reddit-grab corresponds to the Docker image address of 'atdr.meo.ws/archiveteam/reddit-grab'.
#* <code>--concurrent 1</code>: Process 1 item at a time. Although this varies for each project, the maximum recommended value is 5, and the maximum allowed value is 20. Leave this at 1, or check with us on [[IRC]] if you are unsure.
#* <code>--concurrent 1</code>: Process 1 item at a time per container. Although this varies for each project, the maximum recommended value is 5, and the maximum allowed value is 20. Leave this at 1, or check with us on [[Archiveteam:IRC|IRC]] if you are unsure.
#* <code>[username]</code>: Choose a username - we'll show your progress on the [[tracker|leaderboard]]. The brackets should not be included in the final command.
#* <code>[username]</code>: Choose a username - we'll show your progress on the [[tracker|project leaderboard (tracker)]]. The brackets should not be included in the final command.
# If you wish to stop running your containers, run <code>docker stop watchtower archiveteam</code>. If needed, replace "watchtower" and "archiveteam" with the actual container names you used.
 
# Similarly, to start your containers again in the future, run <code>docker start watchtower archiveteam</code>. If needed, replace "watchtower" and "archiveteam" with the actual container names you used.
{{notice|1=On Windows and macOS, once you have completed steps 1-4, you can also start, stop, and delete containers in the Docker Desktop UI. However, for the time being, initial setup and switching projects can only be done from the command line. Docker on Linux (either in a VM or on bare metal hardware) is the recommended way to run Docker containers.}}
# To delete a container, run <code>docker rm archiveteam</code>. If needed, replace "archiveteam" with the name of the actual container you want to delete. To free up disk space, you can also purge your unused Docker images by running <code>docker image prune</code>. Note that this command will delete all Docker images on your system that are not associated with a container, not just Archive Team ones.
 
# Remember to periodically check our [[IRC]] channels and homepage so you switch your scripts to a current project. Projects change frequently at Archive Team, and at the moment we don't have a way to automatically switch the projects run in Docker containers. To switch projects, simply stop your existing Archive Team container by running <code>docker stop archiveteam</code>, and delete it by running <code>docker rm archiveteam</code> and run a new one by repeating step 4. Then, you can optionally prune your unused Docker images as in step 7. Note: you don't need to stop or replace your Watchtower container, just make sure it is still running by using <code>docker ps -f name=watchtower</code>. If Watchtower is not running or you are unsure, run <code>docker start watchtower</code>.
If you prefer '''Podman''' over Docker, [[User:Sanqui]] has had success running the Warrior in Docker using
<code>podman run --detach --name at-warrior --label=io.containers.autoupdate  --restart=on-failure --publish 8001:8001  atdr.meo.ws/archiveteam/warrior-dockerfile</code> and [https://docs.podman.io/en/latest/markdown/podman-auto-update.1.html podman-auto-update] in place of Watchtower.
 
=== Stopping containers ===
# '''Recommended method:''' Attempt graceful stop by sending the SIGINT signal, with no hard-kill deadline:<br><code>docker kill --signal=SIGINT archiveteam</code><br>Explanation:
#* <code>kill</code>: Docker's command for killing a container, defaults to sending a SIGKILL signal unless otherwise specified
#* <code>--signal=SIGINT</code>: tells Docker to send a SIGINT signal to the container (not a SIGKILL)<br>
#* <code>archiveteam</code>: This is the name of the Docker container(s) that need to be stopped. If needed, replace with the actual container name(s) you want to stop. Multiple containers can be stopped with the same command.<br>
# '''Alternate, unrecommended method:''' Attempt stop, with a hard-kill deadline of 1 hour:<br><code>docker stop -t 3600 archiveteam</code><br>Explanation:
#* <code>-t 3600</code>: tells Docker to wait for 3600 seconds (60 minutes) before forcibly stopping the container. Docker's default is <code>-t 10</code> (not recommended). Use <code>-t 0</code>to stop immediately (also not recommended). Hard-kill deadlines are problematic because large multi-GB projects may require long-running jobs (e.g. 48 hours for content download + additional hours of rsync upload time that itself may be delayed by upload bandwidth limits and/or congestion on the rsync target). Please ask in the project [[Archiveteam:IRC|IRC]] channel if you are considering using a hard-kill method, especially for projects where there may not be time for another worker to retry later. (There may be interest in recovering/saving partial WARCs from containers that did not end gracefully.) Also see the FAQ entry about ungraceful stops.<br>
#* <code>archiveteam</code>: This is the name of the Docker container(s) that need to be stopped. If needed, replace with the actual container name(s) you want to stop. Multiple containers can be stopped with the same command.<br>
 
The same commands can also be used to stop the <code>watchtower</code> container.
 
=== Starting containers ===
Similarly, to start your containers again in the future, run <code>docker start watchtower archiveteam</code>. If needed, replace "watchtower" and "archiveteam" with the actual container names you used.


=== Deleting containers ===
To delete a container, run <code>docker rm archiveteam</code>. If needed, replace "archiveteam" with the name of the actual container you want to delete. To free up disk space, you can also purge your unused Docker images by running <code>docker image prune</code>. Note that this command will delete all Docker images on your system that are not associated with a container, not just Archive Team ones.


__TOC__
=== Checking for project updates ===
Remember to periodically check [[Archiveteam:IRC|our IRC]] channels and homepage so you switch your scripts to a current project. Projects change frequently at Archive Team, and at the moment we don't have a way to automatically switch the projects run in Docker containers. To switch projects, simply stop your existing Archive Team container by running <code>docker stop archiveteam</code>, and delete it by running <code>docker rm archiveteam</code> and run a new one by repeating step 4. Then, you can optionally prune your unused Docker images as in step 7. Note: you don't need to stop or replace your Watchtower container, just make sure it is still running by using <code>docker ps -f name=watchtower</code>. If Watchtower is not running or you are unsure, run <code>docker start watchtower</code>.


== FAQ ==
== FAQ ==
Line 65: Line 89:
* Runs on Windows, macOS, and Linux painlessly
* Runs on Windows, macOS, and Linux painlessly
* Ensures consistency in the archived data regardless of your machine's quirks
* Ensures consistency in the archived data regardless of your machine's quirks
* Restarts automatically after a system restart


If you have suggestions for improving this system, please talk to us as described below.
If you have suggestions for improving this system, please talk to us as described below.
Line 72: Line 97:
No. We need "clean" connections. Please ensure the following:
No. We need "clean" connections. Please ensure the following:


* No OpenDNS. No ISP DNS that redirects to a search page. Use non-captive DNS servers.
* Use a DNS server that issues correct responses. Pinging a nonexistent domain should never return any IP, it should return NXDOMAIN. As an example, before 2014 OpenDNS redirected requests for nonexistent domains to a search page with ads. This is not clean. Another example of an "unclean" DNS is [[wikipedia:CleanBrowsing|CleanBrowsing]] which aims to shield its users from fap material. The DNS should preferably not attempt to filter anything, not even phishing domains. 9.9.9.10 from [[wikipedia:Quad9|Quad9]] may be a good public DNS. 8.8.8.8 from Google should be unfiltered as well.
* No ISP connections that inject advertisements into web pages.
* No ISP connections that inject advertisements into web pages or otherwise scan/filter/change content. The practice is less common nowadays as most sites use [[wikipedia:HTTPS|SSL]] which complicates injection. Doesn't stop ''some'' parties from trying anyway.<ref>{{URL|https://www.neowin.net/news/gogo-inflight-internet-is-intentionally-issuing-fake-ssl-certificates/|Gogo Inflight Internet is intentionally issuing fake SSL certificates }}</ref>
* No proxies. Proxies can return bad data. The original HTTP headers and IP address are needed for the WARC file.
* No proxies. Proxies can return bad data. The original HTTP headers and IP address are needed for the WARC file.
* No content-filtering firewalls.
* No content-filtering firewalls.
* No censorship. If you believe your country implements censorship, do not run Archive Team scripts.  
* No major censorship. If you believe your country implements major censorship, do not run a warrior. Examples are [[wikipedia:Internet censorship in China|China]] and [[wikipedia:Censorship in Turkey#Internet censorship|Turkey]]. What content may or may not be accessible is unpredictable in these countries, and requests may return a page that says "this website is blocked" which is unhelpful to archive. "Minor" censorship is far more common: where a small number of sites are blocked, the blocks are widely announced and blocks are not frequently implemented. For example, several countries have blocked [[wikipedia:The Pirate Bay|The Pirate Bay]] and a ruling from the European Commission requires European providers to block access to [[wikipedia:RT (TV network)|RT]] and [[wikipedia:Sputnik (news agency)|Sputnik]]. Another example of "minor" censorship is when access is blocked to sites you wouldn't want to archive in a million years, like those dedicated to hosting imagery of child abuse. While censorship is always a bad idea (and abusive sites should be shut down, not blocked), "minor" censorship ''typically'' won't (..or shouldn't) affect Warrior as the blocks are predictable. Obviously you won't be able to contribute to archiving sites that are blocked for you. When in any doubt, ask on IRC first.
* No Tor. The server may return an error page instead of content if they ban exit nodes.
* No Tor. The server may return an error page instead of content if they ban exit nodes.
* No free cafe wifi. Archiving your cafe's wifi service agreement repeatedly is not helpful.
* No free cafe/public transport/store wifi. Archiving your cafe's wifi service agreement repeatedly is not helpful. In addition, you may slow down the service for the people around you.
* No VPNs. Data integrity is a very high priority for the Archive Team so use of VPNs with the official crawler is discouraged.
* No VPNs. Data integrity is a very high priority for the Archive Team so use of VPNs with the official crawler is discouraged. Servers may also be more likely to deploy a rate limit or serve a [[wikipedia:CAPTCHA|CAPTCHA]] page when using a VPN which is unhelpful to archive.
* We prefer connections from many public IP addresses if possible. (For example, if your apartment building uses a single IP address, we don't want your apartment banned.)
* We prefer connections from many public unshared IP addresses if possible. If a single IP attempts to back up an entire site, it may result in that IP getting banned by the server. Also, if a server ''does'' ban an IP, we'd rather this ban only affects you and not everyone in your apartment building.


=== I turned my Docker container off. Will those tasks be lost? ===
=== I turned my Docker container off. Will those tasks be lost? ===


If you've killed your Docker instance, then the work your container did has been lost. However, the tasks will be returned to the pool after a period of time, and others may claim them. If you want, you can alert the admins via [[IRC]] of what's happened and they can clear the claims your username may have made. but this isn't very important on most projects.
If you've killed your Docker instance, then the work your container did has been lost. However, the tasks will be returned to the pool after a period of time, and others may claim them.  


<!-- === I closed my browser or tab with the warrior's web interface. Will those tasks be lost? ===
<!-- === I closed my browser or tab with the warrior's web interface. Will those tasks be lost? ===


No. The web browser interface just provides a user interface to the warrior. As long as the VM or docker container is not stopped, it will continue normally. -->
No. The web browser interface just provides a user interface to the warrior. As long as the VM or Docker container is not stopped, it will continue normally. -->
=== How much disk space will the Docker container use? ===
=== How much disk space will the Docker container use? ===


Short answer: it depends on the project. <!-- (But never more than 60GB.) -->
Short answer: it depends on the project. Ask in the project IRC channel. <!-- (But never more than 60GB.) -->


Long answer: because each project defines items differently, sizes may vary. A single task may be a small file or a whole subsection of a website. <!-- The virtual machine is configured by default to use an absolute maximum of 60GB. Any unused virtual machine disk space is not used on the host computer. You may run the virtual machine on less than 60GB if you like to live dangerously. We're downloading the internet, after all! -->
Long answer: because each project defines items differently, sizes may vary. A single task may be a small file or a whole subsection of a website. <!-- The virtual machine is configured by default to use an absolute maximum of 60GB. Any unused virtual machine disk space is not used on the host computer. You may run the virtual machine on less than 60GB if you like to live dangerously. We're downloading the internet, after all! -->
Line 100: Line 125:


=== How can I see the status of my archiving? ===
=== How can I see the status of my archiving? ===
You can check the [[tracker|leaderboard]] to see how much you've archived. If you want to see the current status of your Docker container, you can run <code>docker logs -n 0 -f archiveteam</code>. <code>-n 0</code> tells Docker to only show current logs, and <code>-f</code> tells Docker to keep displaying logs as they come in until you press Control-C to stop it. If needed, replace "archiveteam" with the actual name you used for your container.
You can check the [[tracker|project leaderboard]] to see how much you've archived. If you want to see the current status of your Docker container, you can run <code>docker logs --tail 0 -f archiveteam</code>. <code>--tail 0</code> tells Docker to only show newly added log messages, and <code>-f</code> tells Docker to keep displaying logs as they come in until you press Control-C to stop it. If needed, replace "archiveteam" with the actual name you used for your container.


=== How can I look around inside a container? ===
Run this to bring up a command shell inside the container. Replace 'archiveteam' with the name of the container:<br>
<code>sudo docker exec -t -i archiveteam /bin/bash</code>
<!--
<!--
=== How can I set up the Docker container as a system service (so that it starts up on boot and shuts down automatically)? ===
=== How can I set up the Docker container as a system service (so that it starts up on boot and shuts down automatically)? ===


If you are using VirtualBox and running a Linux distribution that uses the systemd init system (like most recent releases), you can follow the short instructions on [http://www.ericerfanian.com/automatically-starting-virtualbox-vms-on-archlinux-using-systemd/ this page]. (The page title specifies Arch Linux, but this will work for other distros as long as they run systemd.) -->
If you are using VirtualBox and running a Linux distribution that uses the systemd init system (like most recent releases), you can follow the short instructions on [http://www.ericerfanian.com/automatically-starting-virtualbox-vms-on-archlinux-using-systemd/ this page]. (The page title specifies Arch Linux, but this will work for other distros as long as they run systemd.)
 
-->
<!-- === How can I run the warrior without a virtual machine? (The VM has too much overhead for a VPS!) ===
<!-- === How can I run the warrior without a virtual machine? (The VM has too much overhead for a VPS!) ===


Line 119: Line 147:
Another alternative is '''running the project manually.''' If you are managing a VPS, it's likely you are comfortable with some Linux stuff. Consult the project wiki page or the source code repository readme file.
Another alternative is '''running the project manually.''' If you are managing a VPS, it's likely you are comfortable with some Linux stuff. Consult the project wiki page or the source code repository readme file.
-->
-->
=== Can I run the Warrior on ARM or some other unusual architecture? ===
No, currently we do not allow ARM (used on Raspberry Pi and M1 Macs) or other non-x86 architectures. This is because we have previously discovered questionable practices in the Wget archive-creating components and are not confident it runs under different endiannesses etc. If you still want to run it apparently Docker can emulate x86_64.
=== How can I run tons of containers easily? ===
=== How can I run tons of containers easily? ===


We assume you've checked with the current ArchiveTeam project what concurrency and resources are needed or useful!
We assume you've checked with the current Archive Team project what concurrency and resources are needed or useful!


Whether your have your own virtual cluster or you're renting someone else's (aka a "[https://fsfe.org/activities/nocloud/ cloud]"), you probably need some [[wikipedia:Category:Orchestration_software|orchestration software]].
Whether your have your own virtual cluster or you're renting someone else's (aka a "[https://fsfe.org/activities/nocloud/ cloud]"), you probably need some [[wikipedia:Category:Orchestration_software|orchestration software]].
Line 138: Line 169:


== Troubleshooting ==
== Troubleshooting ==
=== (Linux) Running Docker commands gives me a permission denied error. How can I fix this? ===
There are a few ways to fix this issue. The fastest way is to put <code>sudo</code> before your Docker commands. This runs the process as the root user. You can also log into your system as root and run the Docker commands from there. Alternatively, you can create a <code>docker</code> user group and add your account to it by running <code>sudo groupadd docker</code>, then <code>sudo usermod -aG docker $USER</code>, and then activate the changes by running <code>newgrp docker</code> or simply logging out and logging back in to your system or rebooting your system<ref>{{URL|https://docs.docker.com/engine/install/linux-postinstall/#manage-docker-as-a-non-root-user}}</ref>.
<!--
<!--
=== I can't connect to localhost. ===
=== I can't connect to localhost. ===


The application is configured to set up port forwarding to the guest machine, and you should be able to access the interface through your web browser at port 8001. If this does not happen, and isn't resolved by rebooting the container (using the ACPI power signals, not suspend/save state and resume), you may need to double-check your machine's network settings (as described [[#How_can_I_run_multiple_virtual_machines_at_the_same_time.3F|above]]). -->
The application is configured to set up port forwarding to the guest machine, and you should be able to access the interface through your web browser at port 8001. If this does not happen, and isn't resolved by rebooting the container (using the ACPI power signals, not suspend/save state and resume), you may need to double-check your machine's network settings (as described [[#How_can_I_run_multiple_virtual_machines_at_the_same_time.3F|above]]). -->
=== I see a message that no item was received. ===
=== I see a message that no item was received. ===


Line 157: Line 192:
(In other words, we don't want to DDoS the servers.)
(In other words, we don't want to DDoS the servers.)


If you like, you can switch to another [[Warrior projects|project]] with less load.
If you like, you can switch to another [[Current_Projects#Warrior-based_projects|project]] with less load.


=== I see a message about code being out of date. ===
=== I see a message about code being out of date. ===


Don't worry. There is a new update ready. You do not need to do anything about this if you are running the container with Watchtower; Watchtower will update its code every five minutes. If you are impatient, please (insert manual update instructions) and it will download the latest code and resume work.
Don't worry. There is a new update ready. You do not need to do anything about this if you are running the container with Watchtower; Watchtower will update its code every hour. If you are impatient, please stop and remove your container, then repeat step 4 in the setup instructions and it will download the latest code and resume work.


=== I'm running the scripts manually and I see a message about code being out of date. ===
=== I'm running the scripts manually and I see a message about code being out of date. ===
Line 171: Line 206:
=== I see messages about rsync errors. ===
=== I see messages about rsync errors. ===


Uh-oh! Something is not right. Please notify us immediately in the appropriate [[IRC]] channel.
If those messages are saying <code>max connections reached -- try again later</code>, then everything is fine and the file will be uploaded eventually.
 
If the above error persists for hours (for the same item), or if the error message says something else, then something is not right. Please notify us immediately in the appropriate [[IRC]] channel.


<!-- === I told the container to shut down from the web interface, but nothing has changed. ===
<!-- === I told the container to shut down from the web interface, but nothing has changed. ===
Line 190: Line 227:


We're not a professional support team so help us help you help us all. See above for [[#Where_can_I_file_a_bug.2C_suggestion.2C_or_a_feature_request.3F|bug reports]], [[#Where_can_I_file_a_bug.2C_suggestion.2C_or_a_feature_request.3F|suggestions]], or [[#I.27d_like_to_help_write_code._Where_can_I_find_more_info.3F|code contributions]].
We're not a professional support team so help us help you help us all. See above for [[#Where_can_I_file_a_bug.2C_suggestion.2C_or_a_feature_request.3F|bug reports]], [[#Where_can_I_file_a_bug.2C_suggestion.2C_or_a_feature_request.3F|suggestions]], or [[#I.27d_like_to_help_write_code._Where_can_I_find_more_info.3F|code contributions]].
=== Recovering from a ungraceful container stop ===
Please ask in the project [[Archiveteam:IRC|IRC]] channel if some of your containers were stopped ungracefully. This includes using a container stop that used a hard-kill, also stops due to system failures or power outages. This is especially important for projects where there may not be enough time for another worker to retry later. Do not attempt to start/restart the affected containers. (Note: it is possible to recover/save partial WARCs using <code>docker cp archiveteam:/grab/ ./</code> or similar from still running containers that are about to be terminated.)


=== Where can I file a bug, suggestion, or a feature request? ===
=== Where can I file a bug, suggestion, or a feature request? ===
Line 199: Line 240:
See [[Warrior projects]].
See [[Warrior projects]].
-->
-->
== Advanced usage ==
=== Resource constraints / CPU priority with cgroups ===
While docker does have per container resource limits<ref>{{URL|https://docs.docker.com/config/containers/resource_constraints|Runtime options with Memory, CPUs, and GPUs}}</ref>, using a cgroup allows you to give a group of containers shared resource constraints. This should be more suited to how most people running many archiveteam projects at once want to control resource usage.
Defining a cgroup on systems that use systemd is fairly straight forward. It's done by creating a .slice file under <code>/etc/systemd/system</code>.
To give an example, let's define a cgroup that only allows processes to use CPU that would have been otherwise idle, which should mean there's no impact on other processes. Here we'll use <code>archiveteam.slice</code> and create the file at <code>/etc/systemd/system/archiveteam.slice</code>:
<pre>
[Slice]
# With the special "idle" weight processes only get cpu if there is otherwise idle capacity
# The CPUWeight can also be set to a number from 1 to 10000 for relative weighting. The default is 100. A higher weight means more CPU time, a lower weight means less.
CPUWeight=idle
# optional: maximum memory usage of this slice, prevents system oom situations if a container balloons due to changes
# When the memory limit is reached, the OOM killer will simply kill processes, so make sure this is just a last line of defense/safety limit to prevent your system from locking up
#MemoryMax=20G
</pre>
For more options run <code>man systemd.resource-control</code> or check the {{URL|https://manpages.debian.org/stable/systemd/systemd.resource-control.5.en.html|Debian online systemd.resource-control man page}}
In order to use this cgroup for a container, it need to be specified during the <code>docker run</code> command via the <code>--cgroup-parent</code> argument, for example:
<pre>
docker run -d --cgroup-parent archiveteam.slice --name archiveteam --label=com.centurylinklabs.watchtower.enable=true --restart=unless-stopped [image address] --concurrent 1 [username]
</pre>
=== Enabling IPv6 ===
Some projects now support (and prefer) IPv6 when available. In Docker, IPv6 support is disabled by default<ref>{{URL|https://docs.docker.com/config/daemon/ipv6/|Docker: Enable IPv6 support}}</ref>
To enable IPv6 support in Docker, you will need to set the <code>experimental</code> and <code>ip6tables</code> properties in docker's <code>daemon.json</code> to <code>true</code>.
This file is usually located under <code>/etc/docker/daemon.json</code>. Here's an example of what the contents might look like after tweaking:
<pre>
{
  "experimental": true,
  "ip6tables": true
}
</pre>
After modifying the config file, you will have to restart the docker daemon.
On linux distros using systemd this is done via <code>systemctl restart docker</code>. It may also be possible to use the <code>service</code> command: <code>service docker restart</code>
After restarting, we need to create a docker network with our IPv6 subnet (in this example, 2001:db8::/64) and a private IPv4 subnet (172.19.0.0/16 in this example, optional, if not specified docker will pick one from it's default ranges). We'll name it <code>ip6net</code> here, but you can pick a name of your choosing
<pre>
docker network create --ipv6 --subnet 2001:db8::/64 --subnet 172.19.0.0/16 ip6net
</pre>
Once the network is created, we need to use it when running a container by specifying <code>--network ip6net</code> (or the name you picked instead of <code>ip6net</code>) in the run command.
<pre>
docker run -d --network ip6net --name archiveteam --label=com.centurylinklabs.watchtower.enable=true --restart=unless-stopped [image address] --concurrent 1 [username]
</pre>
=== Strategies for using many IPv6 ips ===
With the setup from above, docker will pick IPv6 addresses in ascending order, so the first container will get $SUBNET::1, the second one ::2 and so on.
This is not ideal as some sites may rate limit based on units larger than a single IP address (such as a /112, /96 or larger).
==== Manual assignment ====
While you can assign a specific ip manually when creating a container via the <code>--ip</code> argument, especially when running large amounts this is fairly inconvenient.
==== Simple SNAT ====
A more "lazy" way is to make use of NAT to transparently handle this for us. (And yes, NAT and IPv6 in combination should not be a thing, but since it is we can [ab]use it for our purposes!)
To do this, we create the IPv6 network (as above) with a private range. RFC4193<ref>{{URL|https://www.rfc-editor.org/rfc/rfc4193.txt|RFC 4193}}</ref> specifies <code>fc00::/7</code> for this.
As an example, lets use a random /64 such as <code>fdbf:e8f7:b417:575a::/64</code>.
Also note, we turn off automatic iptables rule creation for masquerade here so we can configure this ourselves
<pre>
docker network create --ipv6 -o "com.docker.network.bridge.enable_ip_masquerade=false" --subnet fdbf:e8f7:b417:575a::/64 --subnet 172.19.0.0/16 ip6net
</pre>
And add an ip6tables rule to do SNAT for us across the whole range (using as <code>2001:db8::/64</code> the public ip), also adding the default ipv4 masquerade rule as we told docker not to:
<pre>
ip6tables -t nat -A POSTROUTING -s "fdbf:e8f7:b417:575a::/64" -j SNAT --to-source "2001:db8::-2001:db8:ffff:ffff:ffff:ffff
iptables -t nat -A POSTROUTING -s 172.19.0.0/16 ! -o docker0 -j MASQUERADE
</pre>
There's also a helper script by imer which can do the whole setup in a semi-automatic fashion: https://gist.github.com/imerr/614e534218a6b93be1a40b088dee885a
==== Per-Port SNAT ====
The simple SNAT setup will work great, but will effectively result in each container getting one or two ip addresses due to the way linux's SNAT does ip selection. This works by hashing the source ip and using that as an index for the ip range<ref>{{URL|http://blog.asiantuntijakaveri.fi/2017/03/linux-snat-with-per-connection-source.html|Linux SNAT with per-connection source address from IP pool}}</ref>. There is no way to change this behaviour aside from patching the kernel, but that is out of scope here.
What we can do, however, is create an SNAT rule for each source port, which will give us a wider distribution of addresses.
The following python3 script will do just that:
<pre>
import ipaddress
import subprocess
# CONFIG
# public range
subnet = "2001:db8::/64"
# private range to be snat'ted from
privateV6 = "fdbf:e8f7:b417:575a::/64"
# END CONFIG
def split_ipv6_subnet(subnet, chunks):
        # Convert the subnet to an IPv6 network object
        network = ipaddress.ip_network(subnet, strict=False)
        # Calculate the number of addresses in each chunk
        addresses_per_chunk = network.num_addresses // chunks
        # Ensure that we aren't left with an incomplete final chunk
        if network.num_addresses % chunks:
                addresses_per_chunk += 1
        results = []
        current_address = int(network.network_address)
        end_address = int(network.network_address) + network.num_addresses
        for i in range(chunks):
                # Start of the current chunk
                start = current_address
                # If this is the last chunk, set the end to the end address of the subnet
                if i == chunks - 1:
                        end = end_address
                else:
                        # Otherwise, set the end to the address at the end of the chunk
                        end = start + addresses_per_chunk
                # Convert the start and end addresses back to IPv6 addresses
                start_ip = ipaddress.IPv6Address(start)
                end_ip = ipaddress.IPv6Address(end - 1)  # Subtract 1 to get the last address in the chunk
                results.append(f"{start_ip}-{end_ip}")
                # Update the current address to the end of the chunk
                current_address = end
                # If we've reached the end of the subnet, break out of the loop
                if current_address >= end_address:
                        break
        return results
with open("/proc/sys/net/ipv4/ip_local_port_range", "r") as f:
        content = f.readline()
        PORT_RANGE_START, PORT_RANGE_END = map(int, content.split())
print(PORT_RANGE_START, PORT_RANGE_END)
PORT_RANGE_COUNT = PORT_RANGE_END - PORT_RANGE_START + 1
i = 0
for netRange in split_ipv6_subnet(subnet, PORT_RANGE_COUNT):
        print(str(PORT_RANGE_START + i), "->", netRange)
        subprocess.run(
                ["ip6tables", "-t", "nat", "-A", "POSTROUTING", "-p", "udp", "--sport", str(PORT_RANGE_START + i), "-s",
                privateV6, "-j", "SNAT", "--to-source", netRange])
        subprocess.run(
                ["ip6tables", "-t", "nat", "-A", "POSTROUTING", "-p", "tcp", "--sport", str(PORT_RANGE_START + i), "-s",
                privateV6, "-j", "SNAT", "--to-source", netRange])
        i += 1
</pre>
Note: Adding the 30k+ rules will take a while.
==== Troubleshooting ====
One some providers (OVH and Scaleway are known to need this for example, Hetzner works fine without) that (presumably) use switched networks, the SNAT setup might not work out of the box (imer's script does take care of this) as the network will not know where to send packets to.
Linux on it's own does not reply to neighbour solicitation requests for the whole subnet if they are not added as ips to the device (which is not feasable for large IPv6 subnets).
Thankfully there's a fix for that<ref>{{URL|https://unix.stackexchange.com/questions/667689/ive-bound-an-entire-ipv6-64-now-how-do-i-get-my-kernel-to-respond-to-arp-t/667738#667738|I've "bound" an entire IPv6 /64 - now how do I get my kernel to respond to ARP to accept packets?}}</ref>: We can install npddp and configure it to respond to those requests.
Install ndppd and configure it in /etc/ndppd.conf (replacing <code>eth0</code> with your actual network interface name and <code>2001:db8::/64</code> with your actual subnet):
<pre>
proxy eth0 {
    router no
    rule 2001:db8::/64 {
        static
    }
}
</pre>
== Are you a coder? ==
== Are you a coder? ==



Latest revision as of 17:46, 24 September 2023

Archiveteam1.png This page is currently in draft form and is being worked on. Instructions may be incomplete.
Archive team.png

You can run Archive Team scripts in Docker containers to help with our archiving efforts. It will download sites and upload them to our archive — and it’s really easy to do!

The scripts run in a Docker container, so there is almost no security risk to your computer. "Almost", because in practice nothing is 100% secure. The container will mainly use some of your bandwidth and disk space, as well as some of your CPU and memory. It will get tasks from and report progress to the Tracker.

Basic usage

Docker runs on Windows, macOS, and Linux, and is a free download. Docker runs code in containers, and stores code in images. (Docker requires the professional version of Windows if being run on versions of Windows prior to Windows 10 version 1903.)

Instructions for using Docker CLI on Windows, macOS, or Linux

  1. Download and install Docker from the link above.
  2. Open your terminal. On Windows, you can use either Command Prompt (CMD) or PowerShell, on macOS and Linux you can use Terminal (Bash).
  3. First, we will set up the Watchtower container. Watchtower automatically checks for updates to Docker containers every hour, and if an update is found, it will gracefully shutdown your container, update it, and restart it.
    Use the following command:
    docker run -d --name watchtower --restart=unless-stopped -v /var/run/docker.sock:/var/run/docker.sock containrrr/watchtower --label-enable --cleanup --include-restarting --interval 3600
    Explanation:
    • -d: Detaches the container from the terminal and runs it in the background.
    • --name watchtower: The name that is displayed for the container. A name other than "watchtower" can be specified here if needed.
    • --restart=unless-stopped: This tells Docker to restart the container unless you stop it. This also means that it will restart the container automatically when you reboot your system.
    • -v /var/run/docker.sock:/var/run/docker.sock: This provides the Watchtower container access to your system's Docker socket. Watchtower uses this to communicate with Docker on your system to gracefully shutdown and update your containers.
    • containrrr/watchtower: This is the Docker image address for Watchtower.
    • --label-enable: This tells Watchtower only to update containers that are specifically tagged for auto-updating. This is included to prevent Watchtower from updating any other containers you may have running on your system. If you are only using Docker to run Archive Team projects, or wish to automatically update all containers including those that are not for Archive Team projects, you can leave this off.
    • --cleanup: This tells Watchtower to delete old, outdated Docker images, which helps save disk space on your system.
    • --include-restarting: This tells Watchtower to include containers that are in the 'restarting' state. This is included to update a project container if it's caught in a crash-loop, as it wouldn't otherwise be updated.
    • --interval 3600: This tells Watchtower to check for updates to your Docker containers every hour.
  4. Now we will set up a project container. You'll need to know the image address for the script for the project you want to help out with. If you don't know it, you can ask us on IRC.
    Use the following command:
    docker run -d --name archiveteam --label=com.centurylinklabs.watchtower.enable=true --restart=unless-stopped [image address] --concurrent 1 [username]
    For example, to assist with the Reddit project (#shreddit (on hackint)):
    docker run -d --name archiveteam --label=com.centurylinklabs.watchtower.enable=true --restart=unless-stopped atdr.meo.ws/archiveteam/reddit-grab --concurrent 1 [username]
    Explanation:
    • -d: Detaches the container from the terminal and runs it in the background.
    • --name archiveteam: The name that is displayed for the container. A name other than "archiveteam" can be specified here if needed (e.g. you want to create multiple containers using the same image).
    • --label=com.centurylinklabs.watchtower.enable=true: Labels the container to be automatically updated by Watchtower. You can leave this off if you did not include --label-enable when launching the Watchtower container.
    • --restart=unless-stopped: This tells Docker to restart the container unless you stop it. This also means that it will restart the container automatically when you reboot your system.
    • [image address]: Replace this with the image address for the project you would like to help with. The brackets should not be included in the final command. Additionally, the address should not include https:// or http://, and all characters must be lowercase. Most project images will be made available at 'atdr.meo.ws/archiveteam/$repo-grab' where $repo is the same name as used on code repository. E.g. The code at https://github.com/ArchiveTeam/reddit-grab corresponds to the Docker image address of 'atdr.meo.ws/archiveteam/reddit-grab'.
    • --concurrent 1: Process 1 item at a time per container. Although this varies for each project, the maximum recommended value is 5, and the maximum allowed value is 20. Leave this at 1, or check with us on IRC if you are unsure.
    • [username]: Choose a username - we'll show your progress on the project leaderboard (tracker). The brackets should not be included in the final command.
Archiveteam1.png On Windows and macOS, once you have completed steps 1-4, you can also start, stop, and delete containers in the Docker Desktop UI. However, for the time being, initial setup and switching projects can only be done from the command line. Docker on Linux (either in a VM or on bare metal hardware) is the recommended way to run Docker containers.

If you prefer Podman over Docker, User:Sanqui has had success running the Warrior in Docker using podman run --detach --name at-warrior --label=io.containers.autoupdate --restart=on-failure --publish 8001:8001 atdr.meo.ws/archiveteam/warrior-dockerfile and podman-auto-update in place of Watchtower.

Stopping containers

  1. Recommended method: Attempt graceful stop by sending the SIGINT signal, with no hard-kill deadline:
    docker kill --signal=SIGINT archiveteam
    Explanation:
    • kill: Docker's command for killing a container, defaults to sending a SIGKILL signal unless otherwise specified
    • --signal=SIGINT: tells Docker to send a SIGINT signal to the container (not a SIGKILL)
    • archiveteam: This is the name of the Docker container(s) that need to be stopped. If needed, replace with the actual container name(s) you want to stop. Multiple containers can be stopped with the same command.
  2. Alternate, unrecommended method: Attempt stop, with a hard-kill deadline of 1 hour:
    docker stop -t 3600 archiveteam
    Explanation:
    • -t 3600: tells Docker to wait for 3600 seconds (60 minutes) before forcibly stopping the container. Docker's default is -t 10 (not recommended). Use -t 0to stop immediately (also not recommended). Hard-kill deadlines are problematic because large multi-GB projects may require long-running jobs (e.g. 48 hours for content download + additional hours of rsync upload time that itself may be delayed by upload bandwidth limits and/or congestion on the rsync target). Please ask in the project IRC channel if you are considering using a hard-kill method, especially for projects where there may not be time for another worker to retry later. (There may be interest in recovering/saving partial WARCs from containers that did not end gracefully.) Also see the FAQ entry about ungraceful stops.
    • archiveteam: This is the name of the Docker container(s) that need to be stopped. If needed, replace with the actual container name(s) you want to stop. Multiple containers can be stopped with the same command.

The same commands can also be used to stop the watchtower container.

Starting containers

Similarly, to start your containers again in the future, run docker start watchtower archiveteam. If needed, replace "watchtower" and "archiveteam" with the actual container names you used.

Deleting containers

To delete a container, run docker rm archiveteam. If needed, replace "archiveteam" with the name of the actual container you want to delete. To free up disk space, you can also purge your unused Docker images by running docker image prune. Note that this command will delete all Docker images on your system that are not associated with a container, not just Archive Team ones.

Checking for project updates

Remember to periodically check our IRC channels and homepage so you switch your scripts to a current project. Projects change frequently at Archive Team, and at the moment we don't have a way to automatically switch the projects run in Docker containers. To switch projects, simply stop your existing Archive Team container by running docker stop archiveteam, and delete it by running docker rm archiveteam and run a new one by repeating step 4. Then, you can optionally prune your unused Docker images as in step 7. Note: you don't need to stop or replace your Watchtower container, just make sure it is still running by using docker ps -f name=watchtower. If Watchtower is not running or you are unsure, run docker start watchtower.

FAQ

Why a Docker container in the first place?

A Docker container is a quick, safe, and easy way for newcomers to help us out. It offers many features:

  • Self-updating software infrastructure provided by Watchtower
  • Allows for unattended use
  • In case of software faults, your machine is not ruined
  • Restarts itself in case of runaway programs
  • Runs on Windows, macOS, and Linux painlessly
  • Ensures consistency in the archived data regardless of your machine's quirks
  • Restarts automatically after a system restart

If you have suggestions for improving this system, please talk to us as described below.

Can I use whatever internet access for running scripts?

No. We need "clean" connections. Please ensure the following:

  • Use a DNS server that issues correct responses. Pinging a nonexistent domain should never return any IP, it should return NXDOMAIN. As an example, before 2014 OpenDNS redirected requests for nonexistent domains to a search page with ads. This is not clean. Another example of an "unclean" DNS is CleanBrowsing which aims to shield its users from fap material. The DNS should preferably not attempt to filter anything, not even phishing domains. 9.9.9.10 from Quad9 may be a good public DNS. 8.8.8.8 from Google should be unfiltered as well.
  • No ISP connections that inject advertisements into web pages or otherwise scan/filter/change content. The practice is less common nowadays as most sites use SSL which complicates injection. Doesn't stop some parties from trying anyway.[1]
  • No proxies. Proxies can return bad data. The original HTTP headers and IP address are needed for the WARC file.
  • No content-filtering firewalls.
  • No major censorship. If you believe your country implements major censorship, do not run a warrior. Examples are China and Turkey. What content may or may not be accessible is unpredictable in these countries, and requests may return a page that says "this website is blocked" which is unhelpful to archive. "Minor" censorship is far more common: where a small number of sites are blocked, the blocks are widely announced and blocks are not frequently implemented. For example, several countries have blocked The Pirate Bay and a ruling from the European Commission requires European providers to block access to RT and Sputnik. Another example of "minor" censorship is when access is blocked to sites you wouldn't want to archive in a million years, like those dedicated to hosting imagery of child abuse. While censorship is always a bad idea (and abusive sites should be shut down, not blocked), "minor" censorship typically won't (..or shouldn't) affect Warrior as the blocks are predictable. Obviously you won't be able to contribute to archiving sites that are blocked for you. When in any doubt, ask on IRC first.
  • No Tor. The server may return an error page instead of content if they ban exit nodes.
  • No free cafe/public transport/store wifi. Archiving your cafe's wifi service agreement repeatedly is not helpful. In addition, you may slow down the service for the people around you.
  • No VPNs. Data integrity is a very high priority for the Archive Team so use of VPNs with the official crawler is discouraged. Servers may also be more likely to deploy a rate limit or serve a CAPTCHA page when using a VPN which is unhelpful to archive.
  • We prefer connections from many public unshared IP addresses if possible. If a single IP attempts to back up an entire site, it may result in that IP getting banned by the server. Also, if a server does ban an IP, we'd rather this ban only affects you and not everyone in your apartment building.

I turned my Docker container off. Will those tasks be lost?

If you've killed your Docker instance, then the work your container did has been lost. However, the tasks will be returned to the pool after a period of time, and others may claim them.

How much disk space will the Docker container use?

Short answer: it depends on the project. Ask in the project IRC channel.

Long answer: because each project defines items differently, sizes may vary. A single task may be a small file or a whole subsection of a website.

How can I see the status of my archiving?

You can check the project leaderboard to see how much you've archived. If you want to see the current status of your Docker container, you can run docker logs --tail 0 -f archiveteam. --tail 0 tells Docker to only show newly added log messages, and -f tells Docker to keep displaying logs as they come in until you press Control-C to stop it. If needed, replace "archiveteam" with the actual name you used for your container.

How can I look around inside a container?

Run this to bring up a command shell inside the container. Replace 'archiveteam' with the name of the container:
sudo docker exec -t -i archiveteam /bin/bash

Can I run the Warrior on ARM or some other unusual architecture?

No, currently we do not allow ARM (used on Raspberry Pi and M1 Macs) or other non-x86 architectures. This is because we have previously discovered questionable practices in the Wget archive-creating components and are not confident it runs under different endiannesses etc. If you still want to run it apparently Docker can emulate x86_64.

How can I run tons of containers easily?

We assume you've checked with the current Archive Team project what concurrency and resources are needed or useful!

Whether your have your own virtual cluster or you're renting someone else's (aka a "cloud"), you probably need some orchestration software.

ArchiveTeam volunteers have successfully used a variety of hosting providers and tools (including free trials on AWS and GCE), often just by building their own flavour of virtual server and then repeating it with simple cloud-init scripts (to install and launch docker as above) or whatever tool the hosting provides. If you desire full automation, the archiveteam-infra repository by diggan helps with Terraform on DigitalOcean.

Some custom monitoring scripts also exist, for instance watcher.py.

I'd like to help write code or I want to tweak the scripts to run to my liking. Where can I find more info? Where is the source code and repository?

Check out the Dev documentation for details on the infrastructure and details of the source code layout.

I still have a question!

Check out the general FAQ page. Talk to us on IRC. Use #archiveteam-bs for general questions or the project IRC channel for project-specific instructions.

Troubleshooting

(Linux) Running Docker commands gives me a permission denied error. How can I fix this?

There are a few ways to fix this issue. The fastest way is to put sudo before your Docker commands. This runs the process as the root user. You can also log into your system as root and run the Docker commands from there. Alternatively, you can create a docker user group and add your account to it by running sudo groupadd docker, then sudo usermod -aG docker $USER, and then activate the changes by running newgrp docker or simply logging out and logging back in to your system or rebooting your system[2].

I see a message that no item was received.

This means that there is no work available. This can happen for several reasons:

  • The project has just finished and someone is inspecting the work done. If a problem is discovered, items may be re-queued and more work will become available.
  • You have checked out/claimed too many items. Reduce your concurrency and let others do some of the work too.
  • In a rare case, you have been banned by a tracker administrator because there was a problem with your work: you were requesting too much, you were tampering with the scripts, a malfunction has occurred, or your internet connection is "unclean" (see above).

I see a message about rate limiting.

Don't worry. Keep in mind that although downloading the internet for fun and digital preservation are the primary goals of all Archive Team activities, serious stress on the target's server may occur. The rate limit is imposed by a tracker administrator and should not be subverted.

(In other words, we don't want to DDoS the servers.)

If you like, you can switch to another project with less load.

I see a message about code being out of date.

Don't worry. There is a new update ready. You do not need to do anything about this if you are running the container with Watchtower; Watchtower will update its code every hour. If you are impatient, please stop and remove your container, then repeat step 4 in the setup instructions and it will download the latest code and resume work.

I'm running the scripts manually and I see a message about code being out of date.

This happens when a bug in the scripts is discovered. Bugs are unavoidable, especially when the server is out of our control.

I see messages about rsync errors.

If those messages are saying max connections reached -- try again later, then everything is fine and the file will be uploaded eventually.

If the above error persists for hours (for the same item), or if the error message says something else, then something is not right. Please notify us immediately in the appropriate IRC channel.

The item I'm working on is downloading thousands of URLs and it's taking hours.

Please notify us in the appropriate IRC channel. You may need to restart the container.

The instructions to run the software/scripts are awful and they are difficult to set up.

Well, excuuuuse me, princess!

We're not a professional support team so help us help you help us all. See above for bug reports, suggestions, or code contributions.

Recovering from a ungraceful container stop

Please ask in the project IRC channel if some of your containers were stopped ungracefully. This includes using a container stop that used a hard-kill, also stops due to system failures or power outages. This is especially important for projects where there may not be enough time for another worker to retry later. Do not attempt to start/restart the affected containers. (Note: it is possible to recover/save partial WARCs using docker cp archiveteam:/grab/ ./ or similar from still running containers that are about to be terminated.)

Where can I file a bug, suggestion, or a feature request?

If the issue is related to the web interface or the library that grab scripts are using, see seesaw-kit issues. Other issues should be filed into their own repositories.


Advanced usage

Resource constraints / CPU priority with cgroups

While docker does have per container resource limits[3], using a cgroup allows you to give a group of containers shared resource constraints. This should be more suited to how most people running many archiveteam projects at once want to control resource usage.

Defining a cgroup on systems that use systemd is fairly straight forward. It's done by creating a .slice file under /etc/systemd/system.

To give an example, let's define a cgroup that only allows processes to use CPU that would have been otherwise idle, which should mean there's no impact on other processes. Here we'll use archiveteam.slice and create the file at /etc/systemd/system/archiveteam.slice:

[Slice]
# With the special "idle" weight processes only get cpu if there is otherwise idle capacity
# The CPUWeight can also be set to a number from 1 to 10000 for relative weighting. The default is 100. A higher weight means more CPU time, a lower weight means less.
CPUWeight=idle
# optional: maximum memory usage of this slice, prevents system oom situations if a container balloons due to changes
# When the memory limit is reached, the OOM killer will simply kill processes, so make sure this is just a last line of defense/safety limit to prevent your system from locking up
#MemoryMax=20G

For more options run man systemd.resource-control or check the Debian online systemd.resource-control man page[IAWcite.todayMemWeb]

In order to use this cgroup for a container, it need to be specified during the docker run command via the --cgroup-parent argument, for example:

docker run -d --cgroup-parent archiveteam.slice --name archiveteam --label=com.centurylinklabs.watchtower.enable=true --restart=unless-stopped [image address] --concurrent 1 [username]

Enabling IPv6

Some projects now support (and prefer) IPv6 when available. In Docker, IPv6 support is disabled by default[4]

To enable IPv6 support in Docker, you will need to set the experimental and ip6tables properties in docker's daemon.json to true.

This file is usually located under /etc/docker/daemon.json. Here's an example of what the contents might look like after tweaking:

{
  "experimental": true,
  "ip6tables": true
}

After modifying the config file, you will have to restart the docker daemon. On linux distros using systemd this is done via systemctl restart docker. It may also be possible to use the service command: service docker restart

After restarting, we need to create a docker network with our IPv6 subnet (in this example, 2001:db8::/64) and a private IPv4 subnet (172.19.0.0/16 in this example, optional, if not specified docker will pick one from it's default ranges). We'll name it ip6net here, but you can pick a name of your choosing

docker network create --ipv6 --subnet 2001:db8::/64 --subnet 172.19.0.0/16 ip6net

Once the network is created, we need to use it when running a container by specifying --network ip6net (or the name you picked instead of ip6net) in the run command.

docker run -d --network ip6net --name archiveteam --label=com.centurylinklabs.watchtower.enable=true --restart=unless-stopped [image address] --concurrent 1 [username]

Strategies for using many IPv6 ips

With the setup from above, docker will pick IPv6 addresses in ascending order, so the first container will get $SUBNET::1, the second one ::2 and so on.

This is not ideal as some sites may rate limit based on units larger than a single IP address (such as a /112, /96 or larger).

Manual assignment

While you can assign a specific ip manually when creating a container via the --ip argument, especially when running large amounts this is fairly inconvenient.

Simple SNAT

A more "lazy" way is to make use of NAT to transparently handle this for us. (And yes, NAT and IPv6 in combination should not be a thing, but since it is we can [ab]use it for our purposes!)

To do this, we create the IPv6 network (as above) with a private range. RFC4193[5] specifies fc00::/7 for this. As an example, lets use a random /64 such as fdbf:e8f7:b417:575a::/64.

Also note, we turn off automatic iptables rule creation for masquerade here so we can configure this ourselves

docker network create --ipv6 -o "com.docker.network.bridge.enable_ip_masquerade=false" --subnet fdbf:e8f7:b417:575a::/64 --subnet 172.19.0.0/16 ip6net

And add an ip6tables rule to do SNAT for us across the whole range (using as 2001:db8::/64 the public ip), also adding the default ipv4 masquerade rule as we told docker not to:

ip6tables -t nat -A POSTROUTING -s "fdbf:e8f7:b417:575a::/64" -j SNAT --to-source "2001:db8::-2001:db8:ffff:ffff:ffff:ffff
iptables -t nat -A POSTROUTING -s 172.19.0.0/16 ! -o docker0 -j MASQUERADE

There's also a helper script by imer which can do the whole setup in a semi-automatic fashion: https://gist.github.com/imerr/614e534218a6b93be1a40b088dee885a

Per-Port SNAT

The simple SNAT setup will work great, but will effectively result in each container getting one or two ip addresses due to the way linux's SNAT does ip selection. This works by hashing the source ip and using that as an index for the ip range[6]. There is no way to change this behaviour aside from patching the kernel, but that is out of scope here.

What we can do, however, is create an SNAT rule for each source port, which will give us a wider distribution of addresses.

The following python3 script will do just that:

import ipaddress
import subprocess
# CONFIG
# public range
subnet = "2001:db8::/64"
# private range to be snat'ted from 
privateV6 = "fdbf:e8f7:b417:575a::/64"
# END CONFIG

def split_ipv6_subnet(subnet, chunks):
        # Convert the subnet to an IPv6 network object
        network = ipaddress.ip_network(subnet, strict=False)

        # Calculate the number of addresses in each chunk
        addresses_per_chunk = network.num_addresses // chunks

        # Ensure that we aren't left with an incomplete final chunk
        if network.num_addresses % chunks:
                addresses_per_chunk += 1

        results = []
        current_address = int(network.network_address)
        end_address = int(network.network_address) + network.num_addresses

        for i in range(chunks):
                # Start of the current chunk
                start = current_address

                # If this is the last chunk, set the end to the end address of the subnet
                if i == chunks - 1:
                        end = end_address
                else:
                        # Otherwise, set the end to the address at the end of the chunk
                        end = start + addresses_per_chunk

                # Convert the start and end addresses back to IPv6 addresses
                start_ip = ipaddress.IPv6Address(start)
                end_ip = ipaddress.IPv6Address(end - 1)  # Subtract 1 to get the last address in the chunk

                results.append(f"{start_ip}-{end_ip}")

                # Update the current address to the end of the chunk
                current_address = end

                # If we've reached the end of the subnet, break out of the loop
                if current_address >= end_address:
                        break

        return results


with open("/proc/sys/net/ipv4/ip_local_port_range", "r") as f:
        content = f.readline()
        PORT_RANGE_START, PORT_RANGE_END = map(int, content.split())

print(PORT_RANGE_START, PORT_RANGE_END)
PORT_RANGE_COUNT = PORT_RANGE_END - PORT_RANGE_START + 1
i = 0
for netRange in split_ipv6_subnet(subnet, PORT_RANGE_COUNT):
        print(str(PORT_RANGE_START + i), "->", netRange)
        subprocess.run(
                ["ip6tables", "-t", "nat", "-A", "POSTROUTING", "-p", "udp", "--sport", str(PORT_RANGE_START + i), "-s",
                 privateV6, "-j", "SNAT", "--to-source", netRange])
        subprocess.run(
                ["ip6tables", "-t", "nat", "-A", "POSTROUTING", "-p", "tcp", "--sport", str(PORT_RANGE_START + i), "-s",
                 privateV6, "-j", "SNAT", "--to-source", netRange])
        i += 1

Note: Adding the 30k+ rules will take a while.

Troubleshooting

One some providers (OVH and Scaleway are known to need this for example, Hetzner works fine without) that (presumably) use switched networks, the SNAT setup might not work out of the box (imer's script does take care of this) as the network will not know where to send packets to.

Linux on it's own does not reply to neighbour solicitation requests for the whole subnet if they are not added as ips to the device (which is not feasable for large IPv6 subnets).

Thankfully there's a fix for that[7]: We can install npddp and configure it to respond to those requests.

Install ndppd and configure it in /etc/ndppd.conf (replacing eth0 with your actual network interface name and 2001:db8::/64 with your actual subnet):

proxy eth0 {
    router no
    rule 2001:db8::/64 {
        static
    }
}

Are you a coder?

Like our scripts? Interested in how it works under the hood? Got software skills? Help us improve it!