Dev/Targets

From Archiveteam
< Dev
Revision as of 22:24, 14 November 2024 by Rewby (talk | contribs) (Add a note)
Jump to navigation Jump to search

Note: This page is about targets for DPoS/Warrior projects. Other targets, such as those in use for ArchiveBot or for offloading during IA outages are not included in this description. Their pages are yet to be written.

If you have any questions, you can ping rewby in #archiveteam-dev on Hackint.

Terminology

WARC: An item submitted by a worker. Usually a single .warc.gz or .warc.zst file containing just the data for their ‘item’. These files are typically fairly small. Usually <3MB.

MegaWARC: The IA doesn’t handle small files well, so we combine the WARCs into a MegaWARC of at least 10GB or more.

Pack: A set of files consisting out of a MegaWARC, a metadata json file, and a .tar file containing rejected/corrupted data.

Chunk: A directory full of WARCs that will be “packed” into a pack.

Target (pipeline): A single instance of the target pipeline that handles a single project’s uploads.

Target (server): A server hosting one or more target pipelines.

Components of a target pipeline

Each project’s pipeline consists of several docker containers and a “data” folder. The data folder has a number of subfolders:

  • incoming
  • chunker-work
  • packing-queue
  • packer-work-in
  • packer-work-out
  • upload-queue
  • uploader-work

The data folder is mounted into each container. Please don’t use docker volumes; they’re prone to accidental deletion.

Airsync

An instance of rsync that receives data from workers. It writes them to the “incoming” folder. It first writes them as temp files before renaming them to full fledged uploaded files.

It has a mechanism that keeps track of the amount of space “free” in the “incoming directory” (via df). Two thresholds are specified: A Soft limit and a Hard limit.

When the soft limit is reached the “max connections” setting of rsync are set to -1 to prevent it from accepting new connections but letting it complete existing transfers. This provides a form of backpressure when things are going too fast and we start running out of disk space. (See the space considerations below for why this matters.) If we get too low on disk space, we stop being able to pack new packs and then everything locks up.

When the hard limit is reached, the rsync daemon is killed completely. This ensures there is sufficient space available for the packers to complete what they’re working on.

Chunker

A program that watches the contents of the “incoming” folder and moves completed uploads to “chunker-work/current/”. When the size of “chunker-work/current/” is greater than the desired MegaWARC size, the “current” folder is moved/renamed to “packing-queue/timestamp_randomhex/”.

Two versions of this program exist: One in megawarc-factory that’s written in bash and a newer one written in rust (this one’s a lot faster).

Packer

A program that watches the “packing-queue/” directory for new folders moved there by the chunker. When one is found, it is atomically moved to the “packer-work-in” folder. (Not the contents of the directory, the whole directory is moved under “packer-work-in”.) A corresponding empty folder is created in “packer-work-out”. The packer then takes the files in the “packer-work-in/” folder and makes a MegaWARC in the folder made in “packer-work-out”. When the MegaWARC is complete its folder is moved to “upload-queue”.

Uploader

A program that watches the “upload-queue” directory and when a new sub-directory is found, it moves that directory to “uploader-work” and uploads the data to the IA (before deleting it from disk).

Space considerations

If we take “n” to be the size of the MegaWARC, you need the following amount of space:

  • rsync: 2-3 x n
  • chunker: 1 x n
  • m x packer: m x (2 x n) (one copy for “packer-work-in” and one for “packer-work-out”)
  • o x uploader: o x n

Performance considerations

Packers are single threaded and quite CPU limited. As such, you will likely need multiple instances of the packer for each target pipeline that you’re running. There is no scientific way to determine how many you need. Usually 4 is a good starting point.

Uploaders only do one pack at a time and quite often single tcp connections can be quite throughput limited. As such, you will want to run multiple of these as well. Experience teaches that you can get ~300mbps per instance. Generally, starting with 4 is a good idea.

Everything in the pipeline up to (and including) “packer-work-in” deals in Lots of Small Files. As such, you will want to run this on a fast SSD. Additionally, this is a very write-intensive workload and is known to destroy con-/prosumer SSDs in single-digit months. Using enterprise or even optane SSDs is recommended. Under no circumstances do you want to run this on a HDD.

Due to how we generally expect targets to be doing upwards of 1gbps these days, we run the main permanent targets on smaller (~0.5-1TB) SSDs and just ensure we have enough space (see previous section) to pack+upload at the same (or greater) throughput than we ingest.

In some cases, however, it is desirable to hold data on the target as a buffer before uploading it. In such cases it may be useful to have the “packer-work-out”, “upload-queue”, and “uploader-work” directories on a pool of HDDs to hold the finished packs before they get uploaded. This is quite a situational thing and for “normal operations” the previous setup is preferred.

IMPORTANT: When you do a setup like this, it is important to maintain a specific property; mv commands between the above directories need to be “atomic” (aka be handled by a rename syscall). There are a few requirements; primarily that all these directories are on the same filesystem.

Operational considerations

As most permanent targets are running multiple pipelines, there are a couple of things you must keep in mind. Notably, the disk space is shared between all pipelines. There is a failure mode where one project gets Really Busy and takes up all the space until all projects hit their soft or hard rsync limits.

On some new target nodes we have started using zfs to host the data folders. This allows for quotas to be set on each project’s data directory. The quotas ensure that when a project goes Too Fast, it only backpressures itself. This prevents some projects (urls, telegram, youtube) from stalling the rest.

File integrity

When rsync completes a transfer, the file is considered “done” by the tracker. If we lose a file on the target it gets really fucky or even impossible to recover from this.