ArchiveBot/Monitoring

From Archiveteam
Jump to navigation Jump to search

ArchiveBot communicates with its dashboard using a WebSocket. This means that ArchiveBot can be monitored for other purposes and with other means.

Since the primary ArchiveBot WebSocket server can easily get overloaded, clients must only connect to the primary instance of the archivebot-dashboard-repeater project.

ws://archivebot.archivingyoursh.it/stream

The WebSocket data is usually around 500KiB/s or over 40GiB/day or over 1TiB per month of JSON. Clients of the repeater must keep up with the volume or they will be disconnected. So it is not recommended to monitor it from internet connections with small bandwidth quotas. When monitoring the data with multiple processes, it is recommended to use the archivebot-dashboard-repeater project locally so the data is only downloaded once. It has a docker setup or can be run from a terminal.

cd ws-repeater/ export UPSTREAM=ws://archivebot.archivingyoursh.it/stream uvicorn app:app --host localhost --port 4568 gunicorn app:app -b 127.0.0.1:4568 --worker-class uvicorn.workers.UvicornWorker

Clients

Any WebSocket client will work, these have been used before:

  • Use a command-line WebSocket client and pipe the output to a JSON processor like jq. This can be very flexible but a bit hacky.
    • curl (requires 8.11.0 or later)
    • websocat
    • ulfius uwsc -q ws://archivebot.archivingyoursh.it/stream | tr -d '\10'
  • Use a WebSocket library for your favorite programming language.
  • archivebot-dashboard-repeater - written in Python using the websockets module
  • ab2f - written in Python using the websockets module
  • gs-firehose - written in Rust - doesn't build any more.

Current monitoring

Several people are continuously monitoring the ArchiveBot data for various reasons:

The #archivebot-alerts IRC channel has a bot called nullbot that monitors for anomalous situations to help prevent URL loops, server overloading and other bad situations.

The ab2f service records the WebSocket JSONL data to downloadable files. This can be useful when needing to check further back in the WebSocket history than ArchiveBot itself stores data for.

pabs is monitoring for interesting URLs of several types. These are currently based on the curl | jq method above with a set of match regexes and ignore regexes.

User:Ryz is monitoring (ran by pabs) for interesting URLs of several types:

  • Flash files (swf)
  • Shockwave (dcr dir dxr cct cst cxt drx)
  • Executables and related archives (exe zip 7zip 7z rar dmg ipa deb xapk apk)
  • Java related files (java jar jnlp class)
  • Unity Web Player Archive files (.unity3d)
  • Document files of various kinds (txt doc docx docm xls pptx ppt fodt fods fodp fodg odt ods odp odg odf sxc sxd sxg sxi sxm sxw stw stc std sti xps oxps pdb prc uof uot uos uop wpd wp wp7 wp6 wp5 wp4 slk eps ps pdf rtf xml eml log msg pages djvu djv dbk dockbook fb2 fb2.zip fbz fb3 tex epub)
  • Video files of various kinds (3g2 3gp amv asf avi drc f4a f4b f4p f4v flv gif gifv m3u8 M2TS m2v m4p m4v mkv mng mov mp2 mp4 mpe mpeg mpg mpv MTS mxf nsv ogg ogv qt rm rmvb roq svi TS viv vob webm wmv yuv)
  • Audio files of various kinds (3gp 8svx aa aac aax act aiff alac amr ape au awb cda dss dvf flac gsm iklax ivs m4a m4b m4p mmf mogg movpkg mp3 mpc msv nmf oga ogg opus ra raw rf64 rm sln tta voc vox wav webm wma wv)
  • IPv4 addresses
  • Open directories


c3manu is monitoring for interesting URLs of several types:

  • URLs from the Financial Times Origami Image Service that received HTTP response code 406 "Not Acceptable" for re-archiving at slower speeds
  • github.io subdomains (including up to one additional path segment, as the root domain itself isn't always used) for recursive archival through their personal backlog (requires more processing/work for generating useful lists automatically)
  • https://publish.obsidian.md/ notebooks for later archival through their personal backlog (requires extraction to generate lists for '!ao <')

Ideas

  • TLS errors, for re-archiving the corresponding URLs with http instead of https
  • Anubis/go-away etc resources, for re-archiving the corresponding pages with other UAs
  • Software related files, like README, COPYING etc
  • free.fr https:// URLs time out, for re-saving as http:// URLs
  • MediaWiki action=edit URLs, for saving action=raw URLs, which are not found by AB
  • Etherpad instances, which are JavaScripty but have text/HTML/JSON exports
  • Paste hosting sites other than Pastebin, which often have HTML and raw versions of each page
  • Hyphanet
  • YouTube for discovering unlisted videos and videos referenced by AB jobs. Detection via youtube.com youtu.be ytimg.com, invidious/other alternative frontend instances.
  • Vimeo for future archiving
  • Dailymotion for future archiving
  • puush dislikes the default User-Agent and returns 502s.
  • Discord cannot be saved with AB, requires special tools
  • Google Docs/Drive/Jamboard/My Maps (drive|docs)\.google\.com|jamboard\.com|google.com/maps/d/viewer
  • Google Imgres URLs, for re-archiving the target URLs
  • Feeds from RSS, Atom, and JSON
  • Facebook
  • LiveJournal
  • Twitter
  • Scribd
  • Slideshare
  • Godbolt, for rearchiving /noscript/ versions
  • Mastodon
  • LinkedIn
  • Links that generally have, .htm, .html, .apsx, .jsp, and .pl at the end of the URL
  • Common Gateway Interface
  • Obsolete twimg.com subdomains, for re-archiving the pbs.twimg.com equivalent
  • Image resizing services, for re-archiving the original size
  • Image proxy services, for re-archiving the original URL
  • HTML cache services, for re-archiving the original URL
  • "External link" services, for re-archiving the original URL
  • Machine-translation services, for re-archiving the original URL
  • Yandex Disk links