ArchiveBot/Monitoring
ArchiveBot communicates with its dashboard using a WebSocket. This means that ArchiveBot can be monitored for other purposes and with other means.
Since the primary ArchiveBot WebSocket server can easily get overloaded, clients must only connect to the primary instance of the archivebot-dashboard-repeater project.
ws://archivebot.archivingyoursh.it/stream
The WebSocket data is usually around 500KiB/s or over 40GiB/day or over 1TiB per month of JSON. Clients of the repeater must keep up with the volume or they will be disconnected. So it is not recommended to monitor it from internet connections with small bandwidth quotas. When monitoring the data with multiple processes, it is recommended to use the archivebot-dashboard-repeater project locally so the data is only downloaded once. It has a docker setup or can be run from a terminal.
cd ws-repeater/
export UPSTREAM=ws://archivebot.archivingyoursh.it/stream
gunicorn app:app -b 127.0.0.1:4568 --worker-class uvicorn.workers.UvicornWorker
Clients
Any WebSocket client will work, these have been used before:
- Use a command-line WebSocket client and pipe the output to a JSON processor like jq. This can be very flexible but a bit hacky.
- Use a WebSocket library for your favorite programming language.
- archivebot-dashboard-repeater - written in Python using the websockets module
- gs-firehose - written in Rust - doesn't build any more.
Current monitoring
Several people are continuously monitoring the ArchiveBot data for various reasons:
The #archivebot-alerts IRC channel has a bot that monitors for anomalous situations to help prevent URL loops, server overloading and other bad situations.
pabs is monitoring for interesting URLs of several types. These are currently based on the curl | jq
method above with a set of match regexes and ignore regexes.
- Code forges for Codearchiver and Software Heritage
- Wiki instances for WikiBot
- HTTP to SmolNet proxies
- Tor Onion services and HTTP proxies to Tor Onion services for Tor URLs
- I2P eepsites and HTTP proxies to I2P eepsites
- Flickr URLs that got HTTP 403 errors for re-archiving with a different User-Agent
- Blogspot/Blogger URLs for DPoS archiving
- Imgur URLs for DPoS archiving
- Pastebin URLs for DPoS archiving
- Mediafire URLs for DPoS archiving
- Telegram URLs for DPoS archiving
- Dropbox URLs that need the download URL archiving (not yet running)
- Webring related URLs
- Mailman/2 instances
User:Ryz is monitoring (ran by pabs) for interesting URLs of several types:
- Flash files (swf)
- Shockwave (dcr dir dxr cct cst cxt drx)
- Executables and related archives (exe zip 7zip 7z rar dmg ipa deb xapk apk)
- Java related files (java jar jnlp class)
- Unity Web Player Archive files (.unity3d)
- Document files of various kinds (txt doc docx docm xls pptx ppt fodt fods fodp fodg odt ods odp odg odf sxc sxd sxg sxi sxm sxw stw stc std sti xps oxps pdb prc uof uot uos uop wpd wp wp7 wp6 wp5 wp4 slk eps ps pdf rtf xml eml log msg pages djvu djv dbk dockbook fb2 fb2.zip fbz fb3 tex epub)
- Video files of various kinds (3g2 3gp amv asf avi drc f4a f4b f4p f4v flv gif gifv M2TS m2v m4p m4v mkv mng mov mp2 mp4 mpe mpeg mpg mpv MTS mxf nsv ogg ogv qt rm rmvb roq svi TS viv vob webm wmv yuv)
- Audio files of various kinds (3gp 8svx aa aac aax act aiff alac amr ape au awb cda dss dvf flac gsm iklax ivs m4a m4b m4p mmf mogg movpkg mp3 mpc msv nmf oga ogg opus ra raw rf64 rm sln tta voc vox wav webm wma wv)
- IPv4 addresses
- Open directories
Ideas
- Paste hosting sites other than Pastebin
- Hyphanet
- YouTube for discovering unlisted videos and videos referenced by AB jobs. Detection via youtube.com youtu.be ytimg.com, invidious/other alternative frontend instances.
- Vimeo for future archiving
- Dailymotion for future archiving
- puush dislikes the default User-Agent and returns 502s.
- Discord cannot be saved with AB, requires special tools
- Google Docs/Drive/Jamboard/My Maps
(drive|docs)\.google\.com|jamboard\.com|google.com/maps/d/viewer
- Feeds from RSS, Atom, and JSON
- Mastodon
- Links that generally have, .htm, .html, .apsx, .jsp, and .pl at the end of the URL
- Common Gateway Interface