Difference between revisions of "Running Archive Team Projects with Docker"

From Archiveteam
Jump to navigation Jump to search
(Add more instructions)
(Add more Docker information)
Line 12: Line 12:
== Basic usage ==
== Basic usage ==


Docker runs on Windows, macOS, and Linux, and is a [https://docs.docker.com/get-docker/ free download].
Docker runs on Windows, macOS, and Linux, and is a [https://docs.docker.com/get-docker/ free download]. Docker runs code in '''containers''', and stores code in '''images'''.


<!-- === Quick start instructions for Docker Desktop on Windows and macOS ===
<!-- === Quick start instructions for Docker Desktop on Windows and macOS ===
Line 31: Line 31:
# Download and install Docker from the link above.
# Download and install Docker from the link above.
# Open your terminal. On Windows, you can use either Command Prompt (CMD) or PowerShell, on macOS and Linux you can use Terminal (Bash).
# Open your terminal. On Windows, you can use either Command Prompt (CMD) or PowerShell, on macOS and Linux you can use Terminal (Bash).
# First, we will start the [https://containrrr.dev/watchtower/ Watchtower] container. Watchtower automatically checks for updates to Docker containers every five minutes, and if an update is found, it will gracefully shutdown your container, update it, and restart it.<br />Use the following command:<br /><code>docker run -d --name watchtower --restart=unless-stopped -v /var/run/docker.sock:/var/run/docker.sock containrrr/watchtower --label-enable</code>.<br />Explanation:
# First, we will set up the [https://containrrr.dev/watchtower/ Watchtower] container. Watchtower automatically checks for updates to Docker containers every five minutes, and if an update is found, it will gracefully shutdown your container, update it, and restart it.<br />Use the following command:<br /><code>docker run -d --name watchtower --restart=unless-stopped -v /var/run/docker.sock:/var/run/docker.sock containrrr/watchtower --label-enable</code>.<br />Explanation:
#* <code>-d</code>: Detaches the container from the terminal and runs it in the background.
#* <code>-d</code>: Detaches the container from the terminal and runs it in the background.
#* <code>--name watchtower</code>: The name that is displayed for the container. A name other than "watchtower" can be specified here if needed.
#* <code>--name watchtower</code>: The name that is displayed for the container. A name other than "watchtower" can be specified here if needed.
Line 37: Line 37:
#* <code>-v /var/run/docker.sock:/var/run/docker.sock</code>: This provides the Watchtower container access to your system's Docker socket. Watchtower uses this to communicate with Docker on your system to gracefully shutdown and update your containers.
#* <code>-v /var/run/docker.sock:/var/run/docker.sock</code>: This provides the Watchtower container access to your system's Docker socket. Watchtower uses this to communicate with Docker on your system to gracefully shutdown and update your containers.
#* <code>--label-enable</code>: This tells Watchtower only to update containers that are specifically tagged for auto-updating. This is included to prevent Watchtower from updating any other containers you may have running on your system. If you are only using Docker to run Archive Team projects, or wish to automatically update all containers including those that are not for Archive Team projects, you can leave this off.
#* <code>--label-enable</code>: This tells Watchtower only to update containers that are specifically tagged for auto-updating. This is included to prevent Watchtower from updating any other containers you may have running on your system. If you are only using Docker to run Archive Team projects, or wish to automatically update all containers including those that are not for Archive Team projects, you can leave this off.
# Now we will run a project container. You'll need to know the image address for the script for the project you want to help out with. If you don't know it, you can ask us on [[IRC]].<br />Use the following command:<br /><code>docker run -d --name archiveteam --label=com.centurylinklabs.watchtower.enable=true --restart=unless-stopped [image address] --concurrent 1 [username]</code>.<br />Explanation:
# Now we will set up a project container. You'll need to know the image address for the script for the project you want to help out with. If you don't know it, you can ask us on [[IRC]].<br />Use the following command:<br /><code>docker run -d --name archiveteam --label=com.centurylinklabs.watchtower.enable=true --restart=unless-stopped [image address] --concurrent 1 [username]</code>.<br />Explanation:
#* <code>-d</code>: Detaches the container from the terminal and runs it in the background.
#* <code>-d</code>: Detaches the container from the terminal and runs it in the background.
#* <code>--name archiveteam</code>: The name that is displayed for the container. A name other than "archiveteam" can be specified here if needed.
#* <code>--name archiveteam</code>: The name that is displayed for the container. A name other than "archiveteam" can be specified here if needed.
Line 45: Line 45:
#* <code>--concurrent 1</code>: Process 1 item at a time. Although this varies for each project, the maximum recommended value is 5, and the maximum allowed value is 20. Leave this at 1, or check with us on [[IRC]] if you are unsure.
#* <code>--concurrent 1</code>: Process 1 item at a time. Although this varies for each project, the maximum recommended value is 5, and the maximum allowed value is 20. Leave this at 1, or check with us on [[IRC]] if you are unsure.
#* <code>[username]</code>: Choose a username - we'll show your progress on the [[tracker|leaderboard]]. The brackets should not be included in the final command.
#* <code>[username]</code>: Choose a username - we'll show your progress on the [[tracker|leaderboard]]. The brackets should not be included in the final command.
# If you wish to stop running your containers, run <code>docker stop watchtower archiveteam</code>. If needed, replace "watchtower" and "archiveteam" with the actual container names you used.
# Similarly, to start your containers again in the future, run <code>docker start watchtower archiveteam</code>. If needed, replace "watchtower" and "archiveteam" with the actual container names you used.
# To delete a container, run <code>docker rm archiveteam</code>. If needed, replace "archiveteam" with the name of the actual container you want to delete. To free up disk space, you can also purge your unused Docker images by running <code>docker image prune</code>. Note that this command will delete all Docker images on your system that are not associated with a container, not just Archive Team ones.
# Remember to periodically check our [[IRC]] channels and homepage so you switch your scripts to a current project. Projects change frequently at Archive Team, and at the moment we don't have a way to automatically switch the projects run in Docker containers. To switch projects, simply stop your existing Archive Team container by running <code>docker stop archiveteam</code>, and delete it by running <code>docker rm archiveteam</code> and run a new one by repeating step 4. Then, you can optionally prune your unused Docker images as in step 7. Note: you don't need to stop or replace your Watchtower container, just make sure it is still running by using <code>docker ps -f name=watchtower</code>. If Watchtower is not running or you are unsure, run <code>docker start watchtower</code>.




Line 55: Line 59:
A Docker container is a quick, safe, and easy way for newcomers to help us out. It offers many features:
A Docker container is a quick, safe, and easy way for newcomers to help us out. It offers many features:


* Optional self-updating software infrastructure
* Self-updating software infrastructure provided by Watchtower
* Allows for unattended use
* Allows for unattended use
* In case of software faults, your machine is not ruined
* In case of software faults, your machine is not ruined
Line 72: Line 76:
* No proxies. Proxies can return bad data. The original HTTP headers and IP address are needed for the WARC file.
* No proxies. Proxies can return bad data. The original HTTP headers and IP address are needed for the WARC file.
* No content-filtering firewalls.
* No content-filtering firewalls.
* No censorship. If you believe your country implements censorship, do not run a warrior.  
* No censorship. If you believe your country implements censorship, do not run Archive Team scripts.  
* No Tor. The server may return an error page instead of content if they ban exit nodes.
* No Tor. The server may return an error page instead of content if they ban exit nodes.
* No free cafe wifi. Archiving your cafe's wifi service agreement repeatedly is not helpful.
* No free cafe wifi. Archiving your cafe's wifi service agreement repeatedly is not helpful.
Line 90: Line 94:


Long answer: because each project defines items differently, sizes may vary. A single task may be a small file or a whole subsection of a website. <!-- The virtual machine is configured by default to use an absolute maximum of 60GB. Any unused virtual machine disk space is not used on the host computer. You may run the virtual machine on less than 60GB if you like to live dangerously. We're downloading the internet, after all! -->
Long answer: because each project defines items differently, sizes may vary. A single task may be a small file or a whole subsection of a website. <!-- The virtual machine is configured by default to use an absolute maximum of 60GB. Any unused virtual machine disk space is not used on the host computer. You may run the virtual machine on less than 60GB if you like to live dangerously. We're downloading the internet, after all! -->
<!-- === How can I run the Docker container headlessly (without leaving a window open)? ===


=== How can I run the Docker container headlessly (without leaving a window open)? ===
(add startup and shutdown instructions)
-->
 
=== How can I see the status of my archiving? ===
You can check the [[tracker|leaderboard]] to see how much you've archived. If you want to see the current status of your Docker container, you can run <code>docker logs -n 0 -f archiveteam</code>. <code>-n 0</code> tells Docker to only show current logs, and <code>-f</code> tells Docker to keep displaying logs as they come in until you press Control-C to stop it. If needed, replace "archiveteam" with the actual name you used for your container.


(add startup and shutdown instructions)
<!--
<!--
=== How can I set up the Docker container as a system service (so that it starts up on boot and shuts down automatically)? ===
=== How can I set up the Docker container as a system service (so that it starts up on boot and shuts down automatically)? ===
Line 169: Line 177:
The warrior will attempt to finish the current running tasks before shutting down. If you need to shut down right away, go ahead. Your progress will be lost, but the jobs will eventually cycle out to another user.
The warrior will attempt to finish the current running tasks before shutting down. If you need to shut down right away, go ahead. Your progress will be lost, but the jobs will eventually cycle out to another user.
-->
-->
=== The container is eating all my bandwidth! ===
<!--=== The container is eating all my bandwidth! ===
 
(insert bandwidth limit instructions)


(it seems bandwidth limiting is not a feature in Docker)
-->
=== The item I'm working on is downloading thousands of URLs and it's taking hours. ===
=== The item I'm working on is downloading thousands of URLs and it's taking hours. ===



Revision as of 00:03, 18 December 2020

Archiveteam1.png This page is currently in draft form and is being worked on. Instructions may be incomplete.
Archive team.png

You can run Archive Team scripts in Docker containers to help with our archiving efforts. It will download sites and upload them to our archive — and it’s really easy to do!

The scripts run in a Docker container, so there is no risk to your computer. The container will only use your bandwidth and some of your disk space. It will get tasks from and report progress to the Tracker.

Basic usage

Docker runs on Windows, macOS, and Linux, and is a free download. Docker runs code in containers, and stores code in images.

Instructions for using Docker CLI on Windows, macOS, or Linux

  1. Download and install Docker from the link above.
  2. Open your terminal. On Windows, you can use either Command Prompt (CMD) or PowerShell, on macOS and Linux you can use Terminal (Bash).
  3. First, we will set up the Watchtower container. Watchtower automatically checks for updates to Docker containers every five minutes, and if an update is found, it will gracefully shutdown your container, update it, and restart it.
    Use the following command:
    docker run -d --name watchtower --restart=unless-stopped -v /var/run/docker.sock:/var/run/docker.sock containrrr/watchtower --label-enable.
    Explanation:
    • -d: Detaches the container from the terminal and runs it in the background.
    • --name watchtower: The name that is displayed for the container. A name other than "watchtower" can be specified here if needed.
    • --restart=unless-stopped: This tells Docker to restart the container unless you stop it.
    • -v /var/run/docker.sock:/var/run/docker.sock: This provides the Watchtower container access to your system's Docker socket. Watchtower uses this to communicate with Docker on your system to gracefully shutdown and update your containers.
    • --label-enable: This tells Watchtower only to update containers that are specifically tagged for auto-updating. This is included to prevent Watchtower from updating any other containers you may have running on your system. If you are only using Docker to run Archive Team projects, or wish to automatically update all containers including those that are not for Archive Team projects, you can leave this off.
  4. Now we will set up a project container. You'll need to know the image address for the script for the project you want to help out with. If you don't know it, you can ask us on IRC.
    Use the following command:
    docker run -d --name archiveteam --label=com.centurylinklabs.watchtower.enable=true --restart=unless-stopped [image address] --concurrent 1 [username].
    Explanation:
    • -d: Detaches the container from the terminal and runs it in the background.
    • --name archiveteam: The name that is displayed for the container. A name other than "archiveteam" can be specified here if needed.
    • --label=com.centurylinklabs.watchtower.enable=true: Labels the container to be automatically updated by Watchtower. You can leave this off if you did not include --label-enable when launching the Watchtower container.
    • --restart=unless-stopped: This tells Docker to restart the container unless you stop it.
    • [image address]: Replace this with the image address for the project you would like to help with. The brackets should not be included in the final command.
    • --concurrent 1: Process 1 item at a time. Although this varies for each project, the maximum recommended value is 5, and the maximum allowed value is 20. Leave this at 1, or check with us on IRC if you are unsure.
    • [username]: Choose a username - we'll show your progress on the leaderboard. The brackets should not be included in the final command.
  5. If you wish to stop running your containers, run docker stop watchtower archiveteam. If needed, replace "watchtower" and "archiveteam" with the actual container names you used.
  6. Similarly, to start your containers again in the future, run docker start watchtower archiveteam. If needed, replace "watchtower" and "archiveteam" with the actual container names you used.
  7. To delete a container, run docker rm archiveteam. If needed, replace "archiveteam" with the name of the actual container you want to delete. To free up disk space, you can also purge your unused Docker images by running docker image prune. Note that this command will delete all Docker images on your system that are not associated with a container, not just Archive Team ones.
  8. Remember to periodically check our IRC channels and homepage so you switch your scripts to a current project. Projects change frequently at Archive Team, and at the moment we don't have a way to automatically switch the projects run in Docker containers. To switch projects, simply stop your existing Archive Team container by running docker stop archiveteam, and delete it by running docker rm archiveteam and run a new one by repeating step 4. Then, you can optionally prune your unused Docker images as in step 7. Note: you don't need to stop or replace your Watchtower container, just make sure it is still running by using docker ps -f name=watchtower. If Watchtower is not running or you are unsure, run docker start watchtower.


FAQ

Why a Docker container in the first place?

A Docker container is a quick, safe, and easy way for newcomers to help us out. It offers many features:

  • Self-updating software infrastructure provided by Watchtower
  • Allows for unattended use
  • In case of software faults, your machine is not ruined
  • Restarts itself in case of runaway programs
  • Runs on Windows, macOS, and Linux painlessly
  • Ensures consistency in the archived data regardless of your machine's quirks

If you have suggestions for improving this system, please talk to us as described below.

Can I use whatever internet access for running scripts?

No. We need "clean" connections. Please ensure the following:

  • No OpenDNS. No ISP DNS that redirects to a search page. Use non-captive DNS servers.
  • No ISP connections that inject advertisements into web pages.
  • No proxies. Proxies can return bad data. The original HTTP headers and IP address are needed for the WARC file.
  • No content-filtering firewalls.
  • No censorship. If you believe your country implements censorship, do not run Archive Team scripts.
  • No Tor. The server may return an error page instead of content if they ban exit nodes.
  • No free cafe wifi. Archiving your cafe's wifi service agreement repeatedly is not helpful.
  • No VPNs. Data integrity is a very high priority for the Archive Team so use of VPNs with the official crawler is discouraged.
  • We prefer connections from many public IP addresses if possible. (For example, if your apartment building uses a single IP address, we don't want your apartment banned.)

I turned my Docker container off. Will those tasks be lost?

If you've killed your Docker instance, then the work your container did has been lost. However, the tasks will be returned to the pool after a period of time, and others may claim them. If you want, you can alert the admins via IRC of what's happened and they can clear the claims your username may have made. but this isn't very important on most projects.

How much disk space will the Docker container use?

Short answer: it depends on the project.

Long answer: because each project defines items differently, sizes may vary. A single task may be a small file or a whole subsection of a website.

How can I see the status of my archiving?

You can check the leaderboard to see how much you've archived. If you want to see the current status of your Docker container, you can run docker logs -n 0 -f archiveteam. -n 0 tells Docker to only show current logs, and -f tells Docker to keep displaying logs as they come in until you press Control-C to stop it. If needed, replace "archiveteam" with the actual name you used for your container.


How can I run tons of containers easily?

We assume you've checked with the current ArchiveTeam project what concurrency and resources are needed or useful!

Whether your have your own virtual cluster or you're renting someone else's (aka a "cloud"), you probably need some orchestration software.

ArchiveTeam volunteers have successfully used a variety of hosting providers and tools (including free trials on AWS and GCE), often just by building their own flavour of virtual server and then repeating it with simple cloud-init scripts (to install and launch docker as above) or whatever tool the hosting provides. If you desire full automation, the archiveteam-infra repository by diggan helps with Terraform on DigitalOcean.

Some custom monitoring scripts also exist, for instance watcher.py.

I'd like to help write code or I want to tweak the scripts to run to my liking. Where can I find more info? Where is the source code and repository?

Check out the Dev documentation for details on the infrastructure and details of the source code layout.

I still have a question!

Check out the general FAQ page. Talk to us on IRC. Use #archiveteam-bs for general questions or the project IRC channel for project-specific instructions.

Troubleshooting

I see a message that no item was received.

This means that there is no work available. This can happen for several reasons:

  • The project has just finished and someone is inspecting the work done. If a problem is discovered, items may be re-queued and more work will become available.
  • You have checked out/claimed too many items. Reduce your concurrency and let others do some of the work too.
  • In a rare case, you have been banned by a tracker administrator because there was a problem with your work: you were requesting too much, you were tampering with the scripts, a malfunction has occurred, or your internet connection is "unclean" (see above).

I see a message about rate limiting.

Don't worry. Keep in mind that although downloading the internet for fun and digital preservation are the primary goals of all Archive Team activities, serious stress on the target's server may occur. The rate limit is imposed by a tracker administrator and should not be subverted.

(In other words, we don't want to DDoS the servers.)

If you like, you can switch to another project with less load.

I see a message about code being out of date.

Don't worry. There is a new update ready. You do not need to do anything about this if you are running the container with Watchtower; Watchtower will update its code every five minutes. If you are impatient, please (insert manual update instructions) and it will download the latest code and resume work.

I'm running the scripts manually and I see a message about code being out of date.

This happens when a bug in the scripts is discovered. Bugs are unavoidable, especially when the server is out of our control.

I see messages about rsync errors.

Uh-oh! Something is not right. Please notify us immediately in the appropriate IRC channel.

The item I'm working on is downloading thousands of URLs and it's taking hours.

Please notify us in the appropriate IRC channel. You may need to restart the container.

The instructions to run the software/scripts are awful and they are difficult to set up.

Well, excuuuuse me, princess!

We're not a professional support team so help us help you help us all. See above for bug reports, suggestions, or code contributions.

Where can I file a bug, suggestion, or a feature request?

If the issue is related to the web interface or the library that grab scripts are using, see seesaw-kit issues. Other issues should be filed into their own repositories.

Are you a coder?

Like our scripts? Interested in how it works under the hood? Got software skills? Help us improve it!