By default the bot only grabs a single URL. However it supports recursion, which is rather slow, since every single page needs to be loaded and rendered by a browser. A dashboard is available for watching the progress of such jobs.
||Archive <url> with <concurrency> processes according to recursion <policy>.|
||Get job status for <uuid>.|
||Revoke or abort running job with <uuid>.|
Please note that the commands are case-sensitive.
URL lists can be archived using recursion, for example:
chromebot: a https://transfer.notkiska.pw/inline/UpfR/HollyConrad-tweets -r 1 -j 4
chromebot will assume all lines starting with http(s):// are valid links. Note that the list itself must be returned by the server as an *inline* document, not as a download (attachment).
chromebot has been blacklisted by Instagram. When trying to archive any Instagram.com website, chromebot responds with the following error:
<Instagram.com URL> cannot be queued: Banned by Instagram
Cloudflare DDoS protection
chromebot should be able to circumvent Cloudflare's DDoS protection, but scrolling and other behaviour may be disabled after the reload (issue #13 on GitHub).
In April 2021, it was discovered that the WARCs written by crocoite had incorrect dates. Namely, the revisit records received the date of the daily deduplication run rather than copying the date of retrieval from the replaced response record, leading to a misrepresentation of when the identical capture was found. Further, all records were presented as HTTP/1.1 with made-up headers, including ones using HTTP/2 or any other protocol supported by Chrome (e.g. WebSockets, HTTP/3). These major data integrity problems led to the bot's WARCs being removed from the Wayback Machine index and the bot being shut down indefinitely. The old revisit records' dates can likely not be fixed reliably because the log information is incomplete, hence a reversal of the WBM exclusion is unlikely.