User:Yan/Dev
Welcome, digital archivists, to Archive Team! Programmers, developers, engineers… You're here because the clown is going to delete your data—and you're going to stop that. So, who is ready to rescue some history?
Infrastructure overview
The Archive Team infrastructure is a distributed web processing system used for distributed preservation of service attacks.
Component Overview
Figure | Description |
---|---|
1 | Website in Danger |
2 | Warrior |
3 | Tracker |
4 | Staging Server |
5 | Internet Archive |
Website in Danger
The website in danger is typically a website exhibiting combinations of
- acquihire
- mass layoffs
- neglect, decay, unhealthy, or owners missing in action
- political and legal issues
- robots.txt exclusion file that forbids crawling by Wayback Machine (whether intentionally or unintentionally)
- cultural significance
Warrior
The Warrior is client code run by volunteers that grabs/scrapes the content of the website in danger.
Websites often implement throttling systems to protect themselves for various reasons such as spam or server load. Typical systems use IP address bans. As such, many Warriors, running on many IP addresses, are needed.
Content is usually grabbed and saved in WARC files.
Tracker
The Tracker is server code run by "core" Archive Team volunteers. The Tracker assigns what the Warrior should download and provides a leaderboard.
Staging Server
Staging servers are typically servers running Rsync often run by "core" volunteers. Warriors upload WARC files to these hosts. The hosts queue and package up the WARC files into large WARC files (Megawarcs). Then, the Megawarcs are uploaded to the Internet Archive under the Archive Team collection.
Internet Archive
The Internet Archive is a digital library and archive. It is different from other hosting services because they are not a distribution platform. If there is an legal issue, items are "darked" (made inaccessible to the general public) instead of deleted.
Items are ingested by the Wayback Machine if it
- has warc.gz files,
- has a "web" media type,
- and is under the Archive Team collection.
Since around 2015-2016 (I'm not 100% sure of the timeline), to prevent abuse, the Wayback Machine only ingests WARCs from whitelisted users. Official Archive Team crawls/tools almost always go into the Wayback Machine.
Source code repositories
Fork me on GitHub! File and triage issues, fix bugs, refactor code, submit pull requests… all welcome! Discussion in #archiveteam-dev (on hackint).
The warrior uses the following repos:
Client code
Client code includes code that the Warrior executes.
- warrior3 - bootstrap and tools to build the image
- Bootstrap code that is pulled from GitHub by the appliance and starts a docker container
- archiveteam/warrior-dockerfile - the container
- Instructions to boostrap the docker container
- warrior2 - warrior runner code
- Main code that runs inside of the docker container
- seesaw-kit
- Library that helps build grab scripts, the web interface, and pipeline engine for the warrior. The name "seesaw" comes from its original behavior: download, upload, and repeat.
Projects
Projects are in separate repositories typically with the name -grab
as a suffix.
Item lists that are loaded into the tracker are sometimes saved into a repo with -items
as a suffix. Scripts to build searchable index HTML pages are usually suffixed with -index
.
Server code
Server code includes code that the Tracker executes.
universal-tracker - Ruby
- The server of which the Seesaw contacts
warrior-hq - Ruby
- The server of which the warrior appliances contact for project metadata
archiveteam-megawarc-factory - shell
- The scripts that bundles the WARC files.
URLTeam code
URLTeam code is independent from the tracker and warrior.
Old:
- The client code that scrapes the shortlinks. It includes a pipeline shim to run the code.
- The server code for the tracker.
New:
- A pipeline shim to run the code.
- The code for both the client library and tracker.
Misc
- Dockerfile that runs the warrior inside a Docker container.
ArchiveBot - Ruby, Python, Lua
- An IRC bot for archiving websites.
wget-lua - C, Lua
- A patched version of Wget for web crawling.
standalone-readme-template - Markdown
- A template for readme files included in grab repositories.
archiveteam-dev-env - Shell
- Ubuntu preseed for a developer environment for ArchiveTeam projects.
wpull - Python
- A Wget-compatible web downloader/crawler.
Warrior overview
The Warrior is a virtual machine appliance used by volunteers to participate in projects.
Packages
The Warrior image is built off Alpine Linux 3.6.2:
- kernel 4.9.32
- the virtual machine image is prepared using the
stage.sh
script and contains a pre-installed/root/boot.sh
script that downloads and boots the warrior.
The warrior itself runs in a docker container running Ubuntu 16.04 that contains
- Python 3.5.2, pip 8.1.1
- Perl v5.22.1
- gcc 5.4.0, make 4.1, bash 4.3.48
- curl 7.47.0
Bootup
The virtual machine is self-updating. It does the following:
- Start the virtual machine
- Linux boots
boot.sh
downloads and launches/root/startup.sh
startup.sh
prepares and runs a docker container with the warrior runner- Point your web browser to http://localhost:8001 and go.
Logging into the Warrior
To log into the warrior,
- Press Alt+F3 (or press Alt+Right).
- The username is
root
and the password isarchiveteam
- You are now logged in as root.
- Check the docker container with
docker ps
. This will give you docker container identifier, among others. - Enter the inside of the docker container with
docker exec -it identifier /bin/bash
Testing Core Warrior Code
Since the Warrior pulls from GitHub, it is important to commit only stable changes into the master
branch. Recommended Git branching practices use a development branch.
To test core Warrior code, you can switch from the master
branch to the development
branch. The Warrior will fetch the corresponding seesaw-kit repository branch.
To change branches,
- Log in as root
- Execute
cd /home/warrior/warrior-code2
- Execute
sudo -u warrior git checkout development
- Execute
reboot
By the same route you can return your warrior to the master
branch.
The code for each project is stored in /home/warrior/projects/<PROJECTNAME>/
Starting a new project
Starting a new project is a giant leap into getting things done.
Website Structure
Take a good look at how the website is structured:
- Is everything hosted under one domain name?
- Is there a throttling system?
- How can I discover usernames or page IDs?
- Is there an API?
- Is there a sitemap.xml?
- Can I guess URLs by incrementing a value?
- Does disabling cookies or using specific cookies affect anything?
- Does the website break if you make special requests?
- Can you Google
site:example.com
for some URLs?- Hint:
site:example.com inurl:show_thread
- Hint:
- Is it a video? Try get-flash-videos
JavaScript
JavaScript is a pain.
- Check to see if there's a noscript or mobile version.
- Use a web inspector to observe its behavior and simulate POST requests made by the scripts.
- Scrape URLs from JavaScript templates with regular expressions.
Static Assets
Websites sometimes do not host static media such as images and stylesheets under their primary domain name. Be sure to take those under consideration.
IP Address Bans & Throttling
Find out if there is IP address banning. Use a sacrificial IP address if you need to.
Items
- See also: Dev/Seesaw#Quick Definitions
Once you determine the website structure, you need to determine how to split up work units up efficiently by an item name. An item name is a short string describing the work unit, for example, a username.
Because the Tracker uses Redis as its database, memory usage is a concern. The maximum number of items supported ranges from 5,000,000 to 10,000,000 depending on the item name length.
- If a user site is USERNAME.example.com, a good candidate is USERNAME.
- Be careful of large subdomain sites.
- If the content is by some numerical ID, consider whether ranges of IDs are appropriate.
Call for Action
- ProTip™: Get things done.
Wiki Page
Ensure there is documentation on this wiki about the project.
Include:
- an overview of the website
- the shutdown notice
- "how to help" instructions
- a (future) link to the archives
Writing Grab Scripts
If you do not have permissions to create Archive Team's repository, please ask on IRC.
For detailed information about what goes inside grab scripts, take a look at writing Seesaw scripts.
Tracker Access
If you do not have permission to access the Tracker, please see Tracker#People.
IRC Channel
Archive Team uses per-project IRC channels to reduce noise in the main channel. It also serves as a technical support channel.
IRC channel names must be humorous.
- If an employee of the website in danger appears on the channel, please do cooperate.
Project Management
Successful projects are a result of successful management. See Project Management for details.
Getting Attention
Many Twitter followers? Got connections? Become a loudmouth!
Otherwise, take initiative yourself and encourage other team members to take initiative.
Writing Seesaw grab scripts
Writing a Seesaw content grab script is the most challenging and fun aspect of the infrastructure.
What a Archive Team Project Contains
Once the Git repository has been created, be sure to include the following files:
pipeline.py
- This file contains the Seesaw client code for the project.
README.{md,rst,txt}
- This file contains
- * brief information about the project
- * instructions on how to manually run the scripts
- * A template is available here: standalone-readme-template
[Project Name Here].lua (optional)
- This is the Lua script used by Wget-Lua.
warrior-install.sh (optional)
- This file is executed by the Warrior to install extra libraries needed by the project. Example: punchfork-grab warrior-install.sh.
wget-lua-warrior (optional)
- This executable is a build of Wget-Lua for the warrior environment.
get-wget-lua.sh (optional)
- Build scripts for Wget-Lua for those running scripts manually.
The repository is pulled in by the Warrior or manually be those who want to run the scripts manually.
Writing a pipeline.py (Seesaw Client)
The Seesaw client is a specific set of tasks that must be done within an item. Think of it as a template of instructions. Typically, the file is called pipeline.py. The pipeline file uses the Seesaw Library.
The pipeline file will typically use Wget with Lua scripting. Wget+Lua is a web crawler.
The Lua script is provided as an argument to Wget within the pipeline file. It controls fine grain operations within Wget such as rejecting unneeded URLs or adding more URLs as they are discovered.
The goal of the pipeline is to download, make WARC files, and upload them.
Quick Definitions
item
- a work unit
pipeline
- a series of tasks in an item
task
- a step in getting the item done
Installation
You will need:
- Python 2.6/2.7
- Lua
- Wget with Lua hooks
Typically, you can install these on Ubuntu by running:
sudo apt-get install build-essential lua5.1 liblua5.1-0-dev python python-setuptools python-dev openssl libssl-dev python-pip make libgnutls-dev zlib1g-dev sudo pip install seesaw
You will also need Wget with Lua. There is an Ubuntu PPA or you can build it yourself:
./get-wget-lua.sh
Grab a recent build script from here.
The pipeline file
The pipeline file typically includes:
- A line that checks the minimum seesaw version required
- Copy-and-pasted monkey patches if needed
- A routine to find Wget Lua
- A version number in the form of
YYYYMMDD.NN
- Misc constants
- Custom Tasks:
- PrepareDirectories
- MoveFiles
- Project information saved into the
project
variable - Instructions on how to deal with the item saved into the
pipeline
variable - An undeclared
downloader
variable which will be filled in by the Seesaw library
It is important to remember that each Task is a template on how to deal with each Item. Specific item variables should not be stored on a Task, but rather, it should be saved onto the item: item["my_data"] = "hello"
.
Minimum Seesaw Version Check
if StrictVersion(seesaw.__version__) < StrictVersion("0.0.15"): raise Exception("This pipeline needs seesaw version 0.0.15 or higher.")
This check is used to prevent manual script users from using an obsolete version of Seesaw. The Warrior will always upgrade to the latest Seesaw if dictated in the Tracker's projects.json file.
Version 0.0.15 is the supported legacy version, but it is suggested to rely on the latest version of Seesaw as specified in the Seesaw Python Package Index.
Monkey Patches
Monkey patches such as AsyncPopenFixed
are only provided for legacy versions of Seesaw.
Routine to find Wget-Lua
WGET_LUA = find_executable( "Wget+Lua", ["GNU Wget 1.14.lua.20130523-9a5c"], [ "./wget-lua", "./wget-lua-warrior", "./wget-lua-local", "../wget-lua", "../../wget-lua", "/home/warrior/wget-lua", "/usr/bin/wget-lua" ] ) if not WGET_LUA: raise Exception("No usable Wget+Lua found.")
This routine is a sanity check that aborts the script early if Wget+Lua has not been found. Omit this if needed.
Script Version
VERSION = "20131129.00"
This constant, to be used within pipeline
, is sent to the Tracker and should be embedded within the WARC files. It is used for accounting purposes:
- Tracker admins can check the logs for faulty grab scripts and requeue the faulty items.
- Tracker admins can require the user to upgrade the scripts.
Always change the version whenever you make a non-cosmetic change. Note, this constant is only a variable. Be sure that it is used within pipeline
.
Misc constants
USER_AGENT = "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27 ArchiveTeam" TRACKER_ID = "posterous" TRACKER_HOST = "tracker.archiveteam.org"
Constants like USER_AGENT
and TRACKER_HOST
are good programming practices for clean coding.
Check IP address
This task checks the IP address to ensure the user is not behind a proxy or firewall. Sometimes websites are censored or the user is behind a captive portal (like a coffeeshop wifi) which will ruin results.
class CheckIP(SimpleTask): def __init__(self): SimpleTask.__init__(self, "CheckIP") self._counter = 0 def process(self, item): ip_str = socket.gethostbyname('example.com') if ip_str not in ['1.2.3.4', '1.2.3.6']: item.log_output('Got IP address: %s' % ip_str) item.log_output( 'Are you behind a firewall/proxy? That is a big no-no!') raise Exception( 'Are you behind a firewall/proxy? That is a big no-no!') # Check only occasionally if self._counter <= 0: self._counter = 10 else: self._counter -= 1
PrepareDirectories & MoveFiles
class PrepareDirectories(SimpleTask): """ A task that creates temporary directories and initializes filenames. It initializes these directories, based on the previously set item_name: item["item_dir"] = "%{data_dir}/%{item_name}" item["warc_file_base"] = "%{warc_prefix}-%{item_name}-%{timestamp}" These attributes are used in the following tasks, e.g., the Wget call. * set warc_prefix to the project name. * item["data_dir"] is set by the environment: it points to a working directory reserved for this item. * use item["item_dir"] for temporary files """ def __init__(self, warc_prefix): SimpleTask.__init__(self, "PrepareDirectories") self.warc_prefix = warc_prefix def process(self, item): item_name = item["item_name"] dirname = "/".join(( item["data_dir"], item_name )) if os.path.isdir(dirname): shutil.rmtree(dirname) os.makedirs(dirname) item["item_dir"] = dirname item["warc_file_base"] = "%s-%s-%s" % (self.warc_prefix, item_name, time.strftime("%Y%m%d-%H%M%S")) open("%(item_dir)s/%(warc_file_base)s.warc.gz" % item, "w").close() class MoveFiles(SimpleTask): """ After downloading, this task moves the warc file from the item["item_dir"] directory to the item["data_dir"], and removes the files in the item["item_dir"] directory. """ def __init__(self): SimpleTask.__init__(self, "MoveFiles") def process(self, item): os.rename("%(item_dir)s/%(warc_file_base)s.warc.gz" % item, "%(data_dir)s/%(warc_file_base)s.warc.gz" % item) shutil.rmtree("%(item_dir)s" % item)
These tasks are "tradition" (meaning, they are copied-and-pasted and modified to fit) for managing temporary files.
Note, PrepareDirectories
makes an empty warc.gz file since later tasks expect a warc.gz file.
project variable
project = Project( title = "Posterous", project_html = """ <img class="project-logo" alt="Posterous Logo" src="http://archiveteam.org/images/6/6c/Posterous_logo.png" height="50"/> <h2>Posterous.com <span class="links"> <a href="http://www.posterous.com/">Website</a> · <a href="http://tracker.archiveteam.org/posterous/">Leaderboard</a> </span> </h2> <p><i>Posterous</i> is closing April, 30th, 2013</p> """ , utc_deadline = datetime.datetime(2013, 04, 30, 23, 59, 0) )
This variable is used within the Warrior to show the HTML at the top of the page.
Note, this could be potentially be used to show important messages using <p class="projectBroadcastMessage"></p>
. However, manual script users will not see anything related to this variable so you may want to print out any important messages instead.
pipeline variable
Here's a real chunk of code.
pipeline = Pipeline( # request an item from the tracker (using the universal-tracker protocol) # the downloader variable will be set by the warrior environment # # this task will wait for an item and sets item["item_name"] to the item name # before finishing GetItemFromTracker("http://%s/%s" % (TRACKER_HOST, TRACKER_ID), downloader, VERSION), # create the directories and initialize the filenames (see above) # warc_prefix is the first part of the warc filename # # this task will set item["item_dir"] and item["warc_file_base"] PrepareDirectories(warc_prefix="posterous.com"), # execute Wget+Lua # # the ItemInterpolation() objects are resolved during runtime # (when there is an Item with values that can be added to the strings) WgetDownload([ WGET_LUA, "-U", USER_AGENT, "-nv", "-o", ItemInterpolation("%(item_dir)s/wget.log"), "--no-check-certificate", "--output-document", ItemInterpolation("%(item_dir)s/wget.tmp"), "--truncate-output", "-e", "robots=off", "--rotate-dns", "--recursive", "--level=inf", "--page-requisites", "--span-hosts", "--domains", ItemInterpolation("%(item_name)s,s3.amazonaws.com,files.posterous.com," "getfile.posterous.com,getfile0.posterous.com,getfile1.posterous.com," "getfile2.posterous.com,getfile3.posterous.com,getfile4.posterous.com," "getfile5.posterous.com,getfile6.posterous.com,getfile7.posterous.com," "getfile8.posterous.com,getfile9.posterous.com,getfile10.posterous.com"), "--reject-regex", r"\.com/login", "--timeout", "60", "--tries", "20", "--waitretry", "5", "--lua-script", "posterous.lua", "--warc-file", ItemInterpolation("%(item_dir)s/%(warc_file_base)s"), "--warc-header", "operator: Archive Team", "--warc-header", "posterous-dld-script-version: " + VERSION, "--warc-header", ItemInterpolation("posterous-user: %(item_name)s"), ItemInterpolation("http://%(item_name)s/") ], max_tries = 2, # check this: which Wget exit codes count as a success? accept_on_exit_code = [ 0, 8 ], ), # this will set the item["stats"] string that is sent to the tracker (see below) PrepareStatsForTracker( # there are a few normal values that need to be sent defaults = { "downloader": downloader, "version": VERSION }, # this is used for the size counter on the tracker: # the groups should correspond with the groups set configured on the tracker file_groups = { # there can be multiple groups with multiple files # file sizes are measured per group "data": [ ItemInterpolation("%(item_dir)s/%(warc_file_base)s.warc.gz") ] }, ), # remove the temporary files, move the warc file from # item["item_dir"] to item["data_dir"] MoveFiles(), # there can be multiple items in the pipeline, but this wrapper ensures # that there is only one item uploading at a time # # the NumberConfigValue can be changed in the configuration panel LimitConcurrent(NumberConfigValue(min=1, max=4, default="1", name="shared:rsync_threads", title="Rsync threads", description="The maximum number of concurrent uploads."), # this upload task asks the tracker for an upload target # this can be HTTP or rsync and can be changed in the tracker admin panel UploadWithTracker( "http://%s/%s" % (TRACKER_HOST, TRACKER_ID), downloader = downloader, version = VERSION, # list the files that should be uploaded. # this may include directory names. # note: HTTP uploads will only upload the first file on this list files = [ ItemInterpolation("%(data_dir)s/%(warc_file_base)s.warc.gz") ], # the relative path for the rsync command # (this defines if the files are uploaded to a subdirectory on the server) rsync_target_source_path = ItemInterpolation("%(data_dir)s/"), # extra rsync parameters (probably standard) rsync_extra_args = [ "--recursive", "--partial", "--partial-dir", ".rsync-tmp" ] ), ), # if the item passed every task, notify the tracker and report the statistics SendDoneToTracker( tracker_url = "http://%s/%s" % (TRACKER_HOST, TRACKER_ID), stats = ItemValue("stats") ) )
It's pretty big.
Notice:
- the
downloader
variable should be left undefined ItemInterpolation
holds some magic.ItemInterpolation("%(item_dir)s/wget.log").realize(item)
executesitem % "%(item_dir)s/wget.log"
which gives usitem["item_dir"]+"/wget.log"
--output-document
concatenates everything into a single temporary file.--truncate-output
is a Wget+Lua option. It makes--output-document
into a temporary file option by downloading to the file, extract the URLs, and then set the temporary file to 0 bytes.- the use of
-e robots=off
because robots.txt is bad --lua-script posterous.lua
specifies the Lua script that controls WgetNumberConfigValue
adds another setting to the Warrior's advanced settings page
Lua Script
The Lua script is like a parasite controlling and modifying Wget's behavior from within.
- Recommended reading: Wget with Lua hooks
- Example listings: Wget with Lua hooks#More_Examples
- Reference documentation: Wget with Lua hooks (GitHub)
Generally, scripts will want to use:
download_child_p
httploop_result
get_urls
download_child_p
This hook is useful for advanced URL accepting and rejecting. Although Wget supports regular expression on its command line options, it can be messy. Lua only supports a small subset of regular expressions called Patterns.
httploop_result
This hook is useful for checking if we have been banned or implementing our own --wait
.
Here is a practical example that delays Wget for a minute on a ban or server overload, approximate 1 second between normal requests, and no delay on a content delivery network:
wget.callbacks.httploop_result = function(url, err, http_stat) local sleep_time = 60 local status_code = http_stat["statcode"] if status_code == 420 or status_code >= 500 then if status_code == 420 then io.stdout:write("\nBanned (code "..http_stat.statcode.."). Sleeping for ".. sleep_time .." seconds.\n") else io.stdout:write("\nServer angered! (code "..http_stat.statcode.."). Sleeping for ".. sleep_time .." seconds.\n") end io.stdout:flush() -- Execute the UNIX sleep command (since Lua does not have its own delay function) -- Note that wget has its own linear backoff to this time as well os.execute("sleep " .. sleep_time) -- Tells wget to try again return wget.actions.CONTINUE else -- We're okay; sleep a bit (if we have to) and continue local sleep_time = 1.0 * (math.random(75, 125) / 100.0) if string.match(url["url"], "website-cdn%.net") then -- We should be able to go fast on images since that's what a web browser does sleep_time = 0 end if sleep_time > 0.001 then os.execute("sleep " .. sleep_time) end -- Tells wget to resume normal behavior return wget.actions.NOTHING end end
- You will likely want to be cautious and include the
wget.actions.CONTINUE
action to cover a wide case. Wget may consider a temporary server overload as a permanent error. - Yahoo! likes to use status 999 to indicate a temporary ban.
get_urls
This hook is used to add additional URLs.
This example injects URLs to simulate JavaScript requests:
wget.callbacks.get_urls = function(file, url, is_css, iri) local urls = {} for image_id in string.gmatch(html, "([a-zA-Z0-9]-)/image_thumb.png") do table.insert(urls, { url="http://example.com/photo_viewer.php?imageid="..image_id, post_data="crf_token=deadbeef" }) end return urls end
It can also be used to display a progress message:
url_count = 0 wget.callbacks.get_urls = function(file, url, is_css, iri) url_count = url_count + 1 if url_count % 5 == 0 then io.stdout:write("\r - Downloaded "..url_count.." URLs.") io.stdout:flush() end end
Useful Snippets
Read first 1 kilobyte of a file:
read_file_short = function(file) if file then local f = io.open(file) local data = f:read(4096) f:close() return data or "" else return "" end end
Run a pipeline.py
To run a pipeline file, run the command:
run-pipeline pipeline.py YOUR_NICKNAME
For more options, run:
run-pipeline --help
External Links
- Take a look at the grab scripts in recent Archive Team repositories for examples of clients.
- For more information, consult the seesaw-kit wiki.
Setting up a tracker
This article describes how to set up your own tracker just like the official Archive Team tracker. Use this guide only if you want to do a full test of the infrastructure.
Note: A virtual machine appliance is available at ArchiveTeam/archiveteam-dev-env which contains a ready-to-use tracker. A docker container is also at [1].
Installation will cover:
- Environment: Ubuntu/Debian
- Languages:
- Python
- Ruby
- JavaScript
- Web:
- Nginx
- Phusion Passenger
- Redis
- Node.js
- Tools:
- Screen
- Rsync
- Git
- Wget
- regular expressions
The Tracker
The Tracker manages what items are claimed by users that run the Seesaw client. It also shows a pretty leaderboard.
Let's create a dedicated account to run the web server and tracker:
sudo adduser --system --group --shell /bin/bash tracker
Redis
Redis is database stored in memory. So, item names should be engineered to be memory efficient. Redis saves its database periodically into a file located at /var/lib/redis/6379/dump.rdb. It is safe to copy the file, e.g., for backups.
To install Redis, you may follow these quickstart instructions, but we'll show you how.
These steps are from the quickstart guide:
wget http://download.redis.io/redis-stable.tar.gz tar xvzf redis-stable.tar.gz cd redis-stable make
Now install the server:
sudo make install cd utils sudo ./install_server.sh
Note, by default, it runs as root. Let's stop it and make it run under www-data:
sudo invoke-rc.d redis_6379 stop sudo adduser --system --group www-data sudo chown -R www-data:www-data /var/lib/redis/6379/ sudo chown -R www-data:www-data /var/log/redis_6379.log
Edit the config file /etc/redis/6379.conf
with the options like:
bind 127.0.0.1 pidfile /var/run/shm/redis_6379.pid
Now tell the start up script to run it as www-data:
sudo nano /etc/init.d/redis_6379
Change the EXEC and CLIEXEC variables to use sudo -u www-data -g www-data
:
EXEC="sudo -u www-data -g www-data /usr/local/bin/redis-server" CLIEXEC="sudo -u www-data -g www-data /usr/local/bin/redis-cli" PIDFILE=/var/run/shm/redis_6379.pid
To avoid catastrophe with background saves failing on fork()
(Redis needs lots of memory), run:
sudo sysctl vm.overcommit_memory=1
The above setting will be lost after reboot. Add this line to /etc/sysctl.conf
:
vm.overcommit_memory=1
The log file will get big so we need a logrotate config. Create one at /etc/logrotate.d/redis
with the config:
/var/log/redis_*.log { daily rotate 10 copytruncate delaycompress compress notifempty missingok size 10M }
Start up Redis again using:
sudo invoke-rc.d redis_6379 start
Nginx with Passenger
Nginx is a web server. Phusion Passenger is a module within Nginx that runs Rails applications.
There is a guide on how to install Nginx with Passenger, the following instructions are similar.
Log in as tracker:
sudo -u tracker -i
We'll use RVM to install Ruby libraries:
curl -L get.rvm.io | bash -s stable source ~/.rvm/scripts/rvm rvm requirements
A list of things needed to be installed will be shown. Log out of the tracker account, install them, and log back into the tracker account.
Install Ruby and Bundler:
rvm install 2.2.2 rvm rubygems current gem install bundler
Install Passenger:
gem install passenger
Install Nginx. This command will download, compile, and install a basic Nginx server.:
passenger-install-nginx-module
Use the following prefix for Nginx installation:
/home/tracker/nginx/
Change the location of the tracker software (to be installed later). Edit nginx/conf/nginx.conf
. Use the lines under the "location /" option:
root /home/tracker/universal-tracker/public; passenger_enabled on; client_max_body_size 15M;
The logs will get big so we'll use logrotate. Save this into /home/tracker/logrotate.conf
:
/home/tracker/nginx/logs/error.log /home/tracker/nginx/logs/access.log { daily rotate 10 copytruncate delaycompress compress notifempty missingok size 10M }
To call logrotate, we'll add an entry using crontab:
crontab -e
Now add the following line:
@daily /usr/sbin/logrotate --state /home/tracker/.logrotate.state /home/tracker/logrotate.conf
Log out of the tracker account at this point.
Let's create an Upstart configuration file to start up Nginx. Save this into /etc/init/nginx-tracker.conf
:
description "nginx http daemon" start on runlevel [2] stop on runlevel [016] setuid tracker setgid tracker console output exec /home/tracker/nginx/sbin/nginx -c /home/tracker/nginx/conf/nginx.conf -g "daemon off;"
Or, if you use Systemd, put this into /lib/systemd/system/nginx-tracker.service
:
[Unit] Description="nginx http daemon" [Service] Type=simple ExecStart=/home/tracker/nginx/sbin/nginx -c /home/tracker/nginx/conf/nginx.conf -g "daemon off;"
Tracker
Log in into the tracker account.
Download the Tracker software:
git clone https://github.com/ArchiveTeam/universal-tracker.git
We'll need to configure the location of Redis. Copy the config file:
cp universal-tracker/config/redis.json.example universal-tracker/config/redis.json
Add a "production" object into the JSON file. Here is an example:
{ "development": { "host": "127.0.0.1", "port": 6379, "db": 13 }, "test": { "host": "127.0.0.1", "port": 6379, "db": 14 }, "production": { "host":"127.0.0.1", "port":6379, "db": 1 } }
- Now we may need to fix an issue with Passenger forking after the Redis connection has been made. Please see https://github.com/ArchiveTeam/universal-tracker/issues/5 for more information.
- There is also an issue with non-ASCII names. See https://github.com/ArchiveTeam/universal-tracker/issues/7.
Now install the necessary gems:
cd universal-tracker bundle install
Log out of the tracker account at this point.
Node.js
Node.js is required to run the fancy leaderboard using WebSockets. We'll use NPM to manage the Node.js libraries:
sudo apt-get install npm
Log into the tracker account.
Now, we manually edit the Node.js program because it has problems:
cp -R universal-tracker/broadcaster . nano broadcaster/server.js
Modify env
and trackerConfig
variables to something like this:
var env = { tracker_config: { redis_pubsub_channel: "tracker-log" }, redis_db: 1 }; var trackerConfig = env['tracker_config'];
You also need to modify the "transports" configuration by adding websocket
. The new line should look like this:
io.set("transports", ["websocket", "xhr-polling"]);
Install the Node.js libraries needed:
npm install
If you get an error while installing hiredis, you may need to provide Debian's "nodejs" as "node". Symlink "node" to the nodejs executable and try again.
Log out of the tracker account at this point.
Create an Upstart file at /etc/init/nodejs-tracker.conf
:
description "tracker nodejs daemon" start on runlevel [2] stop on runlevel [016] setuid tracker setgid tracker exec node /home/tracker/broadcaster/server.js
Or, for Systemd, put this into /lib/systemd/system/nodejs-tracker.service
:
[Unit] Description="tracker nodejs daemon" [Service] Type=forking Group=tracker User=tracker ExecStart=/usr/bin/js /home/tracker/broadcaster/server.js
Tracker Setup
Start up the Tracker and Broadcaster:
Upstart:
sudo start nginx-tracker sudo start nodejs-tracker
Systemd:
sudo systemctl start nginx-tracker sudo systemctl start nodejs-tracker
You now need to configure the tracker. Open up your web browser and visit http://localhost/global-admin/.
- In Global-Admin→Configuration→Live logging host, specify the public location of the Node.js app. By default, it uses port 8080.
You are now free to manage the tracker.
Notes:
- If you followed this guide, the rsync location is defined as
rsync://HOSTNAME/PROJECT_NAME/:downloader/
- The trailing slash within the rsync URL is very important. Without it, files will not be uploaded within the directory.
Claims
You probably want to have Cron clearing out old claims. The Tracker includes a Ruby script that will do that for you. By default, it removes claims older than 6 hours. You may want to change that for big items by creating a copy of the script for each project.
To set up Cron, login as the tracker account, and run:
which ruby
Take note of which Ruby executable is used.
Now edit the Cron table:
crontab -e
Add the following line which runs release-stale.rb
every 6 hours:
0 */6 * * * cd /home/tracker/universal-tracker && WHICH_RUBY scripts/release-stale.rb PROJECT_NAME
Logs
Since the Tracker stores logs into Redis, it will use up memory quickly. log-drainer.rb
continuously writes the logs into a text file:
mkdir -p /home/tracker/universal-tracker/logs/ cd /home/tracker/universal-tracker && ruby scripts/log-drainer.rb
Pressing CTRL+C will stop it. Run this within a Screen session.
This crontab entry will compress the log files that haven't been modified in two days:
@daily find /home/tracker/universal-tracker/logs/ -iname "*.log" -mtime +2 -exec xz {} \;
Reducing memory usage
The Passenger Ruby module may use up too much memory. You can add the following lines to your nginx config. Add these inside the http
block:
passenger_max_pool_size 2; passenger_max_requests 10000;
The first line allows spawning maximum of 2 processes. The second line restarts Passenger after 10,000 requests to free memory caused by memory leaks.