ArchiveBot/Monitoring

ArchiveBot communicates with its dashboard using a WebSocket. This means that ArchiveBot can be monitored for other purposes and with other means.

Since the primary ArchiveBot WebSocket server can easily get overloaded, clients must only connect to the primary instance of the archivebot-dashboard-repeater project.

ws://archivebot.archivingyoursh.it/stream

The WebSocket data is usually around 500KiB/s or over 40GiB/day or over 1TiB per month of JSON. Clients of the repeater must keep up with the volume or they will be disconnected. So it is not recommended to monitor it from internet connections with small bandwidth quotas. When monitoring the data with multiple processes, it is recommended to use the archivebot-dashboard-repeater project locally so the data is only downloaded once. It has a docker setup or can be run from a terminal.

cd ws-repeater/ export UPSTREAM=ws://archivebot.archivingyoursh.it/stream uvicorn app:app --host localhost --port 4568 gunicorn app:app -b 127.0.0.1:4568 --worker-class uvicorn.workers.UvicornWorker

Clients

Any WebSocket client will work, these have been used before:

Use a command-line WebSocket client and pipe the output to a JSON processor like jq. This can be very flexible but a bit hacky.
- curl (requires 8.11.0 or later)
- websocat
- ulfius uwsc -q ws://archivebot.archivingyoursh.it/stream | tr -d '\10'
Use a WebSocket library for your favorite programming language.
archivebot-dashboard-repeater - written in Python using the websockets module
ab2f - written in Python using the websockets module
gs-firehose - written in Rust - doesn't build any more.

Current monitoring

Several people are continuously monitoring the ArchiveBot data for various reasons:

The #archivebot-alerts IRC channel has a bot called nullbot that monitors for anomalous situations to help prevent URL loops, server overloading and other bad situations.

The ab2f service records the WebSocket JSONL data to downloadable files. This can be useful when needing to check further back in the WebSocket history than ArchiveBot itself stores data for.

pabs is monitoring for interesting URLs of several types. These are currently based on the curl | jq method above with a set of match regexes and ignore regexes.

Code forges for Codearchiver and Software Heritage
Wiki instances for Wikibot
HTTP to SmolNet proxies
Tor Onion services and HTTP proxies to Tor Onion services for Tor URLs
I2P eepsites and HTTP proxies to I2P eepsites
Flickr URLs that got HTTP 403 errors for re-archiving with a different User-Agent
Blogspot/Blogger URLs for DPoS archiving
Imgur URLs for DPoS archiving
Pastebin URLs for DPoS archiving
Mediafire URLs for DPoS archiving
Telegram URLs for DPoS archiving
Dropbox URLs that need the download URL archiving (not yet running)
Webring related URLs
Mailman/2 instances

User:Ryz is monitoring (ran by pabs) for interesting URLs of several types:

Flash files (swf)
Shockwave (dcr dir dxr cct cst cxt drx)
Executables and related archives (exe zip 7zip 7z rar dmg ipa deb xapk apk)
Java related files (java jar jnlp class)
Unity Web Player Archive files (.unity3d)
Document files of various kinds (txt doc docx docm xls pptx ppt fodt fods fodp fodg odt ods odp odg odf sxc sxd sxg sxi sxm sxw stw stc std sti xps oxps pdb prc uof uot uos uop wpd wp wp7 wp6 wp5 wp4 slk eps ps pdf rtf xml eml log msg pages djvu djv dbk dockbook fb2 fb2.zip fbz fb3 tex epub)
Video files of various kinds (3g2 3gp amv asf avi drc f4a f4b f4p f4v flv gif gifv m3u8 M2TS m2v m4p m4v mkv mng mov mp2 mp4 mpe mpeg mpg mpv MTS mxf nsv ogg ogv qt rm rmvb roq svi TS viv vob webm wmv yuv)
Audio files of various kinds (3gp 8svx aa aac aax act aiff alac amr ape au awb cda dss dvf flac gsm iklax ivs m4a m4b m4p mmf mogg movpkg mp3 mpc msv nmf oga ogg opus ra raw rf64 rm sln tta voc vox wav webm wma wv)
IPv4 addresses
Open directories

c3manu is monitoring for interesting URLs of several types:

URLs from the Financial Times Origami Image Service that received HTTP response code 406 "Not Acceptable" for re-archiving at slower speeds
github.io subdomains (including up to one additional path segment, as the root domain itself isn't always used) for recursive archival through their personal backlog (requires more processing/work for generating useful lists automatically)
https://publish.obsidian.md/ notebooks for later archival through their personal backlog (requires extraction to generate lists for '!ao <')

Ideas

TLS errors, for re-archiving the corresponding URLs with http instead of https
Anubis/go-away etc resources, for re-archiving the corresponding pages with other UAs
Software related files, like README, COPYING etc
free.fr https:// URLs time out, for re-saving as http:// URLs
MediaWiki action=edit URLs, for saving action=raw URLs, which are not found by AB
Etherpad instances, which are JavaScripty but have text/HTML/JSON exports
Paste hosting sites other than Pastebin, which often have HTML and raw versions of each page
Hyphanet
YouTube for discovering unlisted videos and videos referenced by AB jobs. Detection via youtube.com youtu.be ytimg.com, invidious/other alternative frontend instances.
Vimeo for future archiving
Dailymotion for future archiving
puush dislikes the default User-Agent and returns 502s.
Discord cannot be saved with AB, requires special tools
Google Docs/Drive/Jamboard/My Maps (drive|docs)\.google\.com|jamboard\.com|google.com/maps/d/viewer
Google Imgres URLs, for re-archiving the target URLs
Feeds from RSS, Atom, and JSON
Facebook
LiveJournal
Twitter
Scribd
Slideshare
Godbolt, for rearchiving /noscript/ versions
Mastodon
LinkedIn
Links that generally have, .htm, .html, .apsx, .jsp, and .pl at the end of the URL
Common Gateway Interface
Obsolete twimg.com subdomains, for re-archiving the pbs.twimg.com equivalent
Image resizing services, for re-archiving the original size
Image proxy services, for re-archiving the original URL
HTML cache services, for re-archiving the original URL
"External link" services, for re-archiving the original URL
Machine-translation services, for re-archiving the original URL
Yandex Disk links

ArchiveBot/Monitoring

Clients

Current monitoring

Ideas

Navigation menu

Search