Chromebot

From Archiveteam
Revision as of 21:11, 16 April 2021 by Iki (talk | contribs) (+info on Wayback Machine ingestion and matching URLs to items)
Jump to navigation Jump to search

chromebot aka. crocoite is an IRC bot parallel to ArchiveBot that uses Google Chrome and thus is able to archive JavaScript-heavy and bottomless websites. WARCs are uploaded twice a day to the chromebot collection on archive.org and are later ingested into the Wayback Machine. For a given item in the collection, you can see what URLs are saved in the warc by looking at the associated jobs.json.gz file.

By default the bot only grabs a single URL. However it supports recursion, which is rather slow, since every single page needs to be loaded and rendered by a browser. A dashboard is available for watching the progress of such jobs.

Usage

crocoite usage documentation on GitHub

You can call chromebot on the #archivebot (on hackint) IRC channel, which chromebot shares with ArchiveBot. Both “chromebot” and “chromebot:” work, with or without the colon.

Command Description

chromebot: a <url> -r <policy> -j <concurrency>

Archive <url> with <concurrency> processes according to recursion <policy>.
chromebot: s <uuid> Get job status for <uuid>.
chromebot: r <uuid> Revoke or abort running job with <uuid>.

Please note that the commands are case-sensitive.

URL lists can be archived using recursion, for example:

chromebot: a https://transfer.notkiska.pw/inline/UpfR/HollyConrad-tweets -r 1 -j 4

chromebot will assume all lines starting with http(s):// are valid links. Note that the list itself must be retured by the server as an *inline* document, not as a download (attachment).

Restrictions

Instagram

chromebot has been blacklisted by Instagram. When trying to archive any Instagram.com website, chromebot responds with the following error:

<Instagram.com URL> cannot be queued: Banned by Instagram

Cloudflare DDoS protection

chromebot should be able to circumvent Cloudflare's DDoS protection, but scrolling and other behaviour may be disabled after the reload (issue #13 on GitHub).

People

PurpleSym maintains software, scripts, pays the server bills and has administrative access. katocala is a server administrator.