ArchiveBot/Monitoring

From Archiveteam
Jump to navigation Jump to search

ArchiveBot communicates with its dashboard using a WebSocket. This means that ArchiveBot can be monitored for other purposes and with other means.

Since the primary ArchiveBot WebSocket server can easily get overloaded, clients must only connect to the primary instance of the archivebot-dashboard-repeater project.

ws://archivebot.archivingyoursh.it/stream

The WebSocket data is usually around 500KiB/s or over 40GiB/day or over 1TiB per month of JSON. Clients of the repeater must keep up with the volume or they will be disconnected. So it is not recommended to monitor it from internet connections with small bandwidth quotas. When monitoring the data with multiple processes, it is recommended to use the archivebot-dashboard-repeater project locally so the data is only downloaded once. It has a docker setup or can be run from a terminal.

cd ws-repeater/ export UPSTREAM=ws://archivebot.archivingyoursh.it/stream gunicorn app:app -b 127.0.0.1:4568 --worker-class uvicorn.workers.UvicornWorker

Clients

Any WebSocket client will work, these have been used before:

  • Use a command-line WebSocket client and pipe the output to a JSON processor like jq. This can be very flexible but a bit hacky.
    • curl (requires 8.11.0 or later)
    • websocat
    • ulfius uwsc -q ws://archivebot.archivingyoursh.it/stream | tr -d '\10'
  • Use a WebSocket library for your favorite programming language.
  • archivebot-dashboard-repeater - written in Python using the websockets module
  • gs-firehose - written in Rust - doesn't build any more.

Current monitoring

Several people are continuously monitoring the ArchiveBot data for various reasons:

The #archivebot-alerts IRC channel has a bot that monitors for anomalous situations to help prevent URL loops, server overloading and other bad situations.

pabs is monitoring for interesting URLs of several types. These are currently based on the curl | jq method above with a set of match regexes and ignore regexes.

User:Ryz is monitoring (ran by pabs) for interesting URLs of several types:

  • Flash files (swf)
  • Shockwave (dcr dir dxr cct cst cxt drx)
  • Executables and related archives (exe zip 7zip 7z rar dmg ipa deb xapk apk)
  • Java related files (java jar jnlp class)
  • Unity Web Player Archive files (.unity3d)
  • Document files of various kinds (txt doc docx docm xls pptx ppt fodt fods fodp fodg odt ods odp odg odf sxc sxd sxg sxi sxm sxw stw stc std sti xps oxps pdb prc uof uot uos uop wpd wp wp7 wp6 wp5 wp4 slk eps ps pdf rtf xml eml log msg pages djvu djv dbk dockbook fb2 fb2.zip fbz fb3 tex epub)
  • Video files of various kinds (3g2 3gp amv asf avi drc f4a f4b f4p f4v flv gif gifv M2TS m2v m4p m4v mkv mng mov mp2 mp4 mpe mpeg mpg mpv MTS mxf nsv ogg ogv qt rm rmvb roq svi TS viv vob webm wmv yuv)
  • Audio files of various kinds (3gp 8svx aa aac aax act aiff alac amr ape au awb cda dss dvf flac gsm iklax ivs m4a m4b m4p mmf mogg movpkg mp3 mpc msv nmf oga ogg opus ra raw rf64 rm sln tta voc vox wav webm wma wv)
  • IPv4 addresses
  • Open directories

Ideas