Tracker

From Archiveteam
Jump to navigation Jump to search
Project admin overview

The tracker software is the center-pivot of Archiveteam's distributed archiving efforts. It hands out items to be downloaded and keeps track of what is completed. Items can be usernames, subdomains, full URLs, basically any unit we can use to break the site into manageable chunks. The progress of each project can be viewed via the leaderboard interface on https://tracker.archiveteam.org.

A leaderboard

The Warrior is the yang to the Tracker's yin. The warriors get the list of current projects from the project file on https://warriorhq.archiveteam.org/.

Using the proprietary tracker

Userscript

A custom (but unofficial) userscript, which can be installed using Tampermonkey or other userscript managers, can be used to replace the tracker leaderboard JS to enable the display of additional information including b/s and i/s values as well as time estimates.

Rate limit, backfeed, queues, etc.

Since some time around 2020 the tracker has supported "backfeed", where there is an HTTP endpoint grab scripts can use to queue additional items to the tracker. For any given item, backfeed's bloom filter has roughly a one in a million chance of falsely detecting it as a duplicate (and therefore not queuing it).

The tracker has several "queues", which correspond to Redis data structures of some kind. These include the main todo, todo:backfeed, todo:secondary, etc. To limit Redis' memory usage, parts of the queue, including completed items, are moved to disk ("offloaded") in larger projects using a set of Redis scripts. Offloaded items are hard to work with for further processing. Claims are always kept in memory because the tracker needs to be able to work with them, so if too many items are claimed, it may cause issues; for this reason, projects with many items usually have a claims limit (see below) set.

On massive projects like Telegram and URLs, completed items are discarded from the tracker entirely due to the sheer number of items. However, these items can still be retrieved from elsewhere if really necessary.

As the warrior has no mechanism to report failed items, we deal with these by "reclaiming" them after they have been in out/claims for a while. The minimum age of a claimed item to be eligible for reclaiming is called its TTL (time to live), which is multiplied by the number of times the item has been claimed. By default on new projects this is disabled. Reclaiming does not change any numbers and cycles things immediately from claims/out back into itself.

The order of priority for a warrior to get an item when it requests it is:

  • reclaims if there is a "claims limit" (a per-project number that caps the total size of "out") and it has been hit
  • reclaims if there is a per-item-type rate limit capping each item type (unclear what the specifics of this is)
  • todo (main queue)
  • todo:backfeed (items discovered by workers processing other items)
  • todo:secondary (typically items of lower priority)
  • todo:redo (typically items that have been claimed but took too long to be returned)
  • reclaims

Note that some projects give the four todo: queues different meanings than listed here.

The tracker supports pattern limits, where items that match a specific pattern are collectively rate-limited. However, this is computationally expensive, and because of how the tracker requests items, can result in a low item request serve rate (see below). Specifically, the tracker goes to each queue and tries to claim up to N items randomly, and then filters it. If it hasn't claimed enough items yet, it goes to the next queue. So if most of the queue is rate-limited, it may not be able to claim anything at all.

None of this is accessible except by arkiver, JAA, and rewby, who need to be asked for any operations involved in enabling reclaiming, setting the claims limit, moving things between queues, etc.

Stats

At the bottom of the tracker page are various statistics. These statistics can be helpful to know how the project is doing. The most-used ones are:

  • Item request serve rate (IRSR): Percentage of item requests that are successful (i.e. given an item).
  • Reclaim rate: Percentage of items that have been reclaimed at least once.
  • Reclaim serve rate: Percentage of successful item requests that are given a reclaim.
  • Round-trip time (RTT): Average time between an item being claimed and marked as complete. Note that because multiple items are often given to workers at once, this is not necessarily how long each item took individually.

The statistics view relies on the websocket, so if there is no activity on a project or the websocket server is stuck (which does happen from time to time), the stats can't be viewed.

API

This is a sample project snippet from the projects.json file (line breaks included for readability):

{
    "name": "streetfiles",
    "title": "Streetfiles",
    "description": "Streetfiles is closing April, 30th, 2013.",
    "repository": "https://github.com/ArchiveTeam/streetfiles-grab.git",
    "logo": "http://archiveteam.org/images/7/7b/Streetfiles-logo.png",
    "marker_html": 
        "<a href='http://tracker.archiveteam.org/streetfiles/'>
        <img src='http://archiveteam.org/images/7/7b/Streetfiles-logo.png'
        alt='Streetfiles' width='235' height='50' /></a>",
    "deadline": "2013-04-30T23:59:59Z",
    "host": "streetfiles.org",
    "leaderboard": "http://tracker.archiveteam.org/streetfiles/",
    "lat_lng": [
        51,
        9
    ]
},

It shows where to get the grab code and other project information.

Here is an example root of the file:

{
    "auto_project": "projectslug",
    "broadcast_message": "<p>This message is shown only 
        in the warrior VM web UI at time of writing.</p>",
    "tracker_banner_html": "This is shown on the tracker 
        front page. <em>Wow!</em>",
    "warrior": {"seesaw_version":"0.7.0"},
    "projects": [],
}

Hardware

In 20??-present day (2023), the tracker runs on "several beefy Hetzner servers".

Software

  • A proprietary system consisting of a large wrapper around the original Universal Tracker, a Ruby HTTP application that sends and receives JSON payloads and uses Redis for the data store.
  • Redis A memory-based key-value store
  • Debian is the Linux distribution the stack is built upon.
  • warrior-hq a small Sinatra web app to manage the Warriors and display the geo-location world map.

You can also set up your own tracker.

History

This history is both incomplete (it stops in early 2012) and probably wrong in areas.

Originally, ArchiveTeam coordinated large projects through the wiki, keeping tables of (the analogues of what are called in tracker terminology) items and letting people claim them and update their progress by editing the wiki.[1]Midway through the Google Video project, in April 2011, Underscor created a system called "Listerine" that did this automatically[2][3] (a client for the Listerine protocol can be found here.) This remained an attractive concept (and the name "tracker" seems to have originated during a wistful discussion about this[4]), and later in the year, Alard, the de facto project lead/resident enthusiast of MobileMe, wanted Underscor to set up another Listerine instance for it[5]. Apparently this never happened, since a few weeks later Alard wrote his own system, which had something resembling the current tracker protocol.[6] A few months later, this was replaced by a shell script called "Seesaw", which added automatic uploading with Rsync[7]

Sometime in the late 2010s the open-source tracker was gradually replaced with the proprietary one. Then or in the early 2020s backfeed and multi-item support was added. As of 2023 most of the admin functionality is broken, as far as I know; everything but setting the minimum version and the rate limit is done with non-public methods by the tracker admins.