Difference between revisions of "Chromebot"

From Archiveteam
Jump to navigation Jump to search
m
(→‎Usage: URL lists)
Line 13: Line 13:
 
|-  
 
|-  
 
| white-space: nowrap |
 
| white-space: nowrap |
<code>chromebot: a <url></code><br />
+
<code>chromebot: a <url> -r <policy> -j <concurrency></code>
<code>chromebot: a <url> <concurrency></code><br />
 
<code>chromebot: a <url> <concurrency> <policy></code><br />
 
<code>chromebot a <url></code><br />
 
<code>chromebot a <url> <concurrency></code><br />
 
<code>chromebot a <url> <concurrency> <policy></code><br /> 
 
 
 
 
|| Archive ''<url>'' with ''<concurrency>'' processes according to recursion ''<policy>''.
 
|| Archive ''<url>'' with ''<concurrency>'' processes according to recursion ''<policy>''.
 
|-
 
|-
| <code>chromebot: s <uuid></code><br /><code>chromebot s <uuid></code> || Get job status for ''<uuid>''.
+
| <code>chromebot: s <uuid></code></code> || Get job status for ''<uuid>''.
 
|-
 
|-
| <code>chromebot: r <uuid></code><br /><code>chromebot r <uuid></code> || Revoke or abort running job with ''<uuid>''.
+
| <code>chromebot: r <uuid></code></code> || Revoke or abort running job with ''<uuid>''.
 
|}
 
|}
  
 
Please note that the commands are case-sensitive.
 
Please note that the commands are case-sensitive.
 +
 +
URL lists can be archived using recursion, for example:
 +
 +
<code>chromebot: a https://transfer.notkiska.pw/inline/UpfR/HollyConrad-tweets -r 1 -j 4</code>
 +
 +
chromebot will assume all lines starting with http(s):// are valid links. Note that the list itself must be retured by the server as an *inline* document, not as a download (attachment).
  
 
== Restrictions ==
 
== Restrictions ==

Revision as of 12:58, 22 May 2019

chromebot aka. crocoite is an IRC bot parallel to ArchiveBot that uses Google Chrome and thus is able to archive JavaScript-heavy and bottomless websites. Both, software and bot, are maintained by User:PurpleSymphony. WARCs are uploaded daily to the chromebot collection on archive.org.

By default the bot only grabs a single URL. However it supports recursion, which is rather slow, since every single page needs to be loaded and rendered by a browser. A dashboard is available for watching the progress of such jobs.

Usage

crocoite usage documentation on GitHub

You can call chromebot on the #archivebot (on EFnet) IRC channel, which chromebot shares with ArchiveBot. Both “chromebot” and “chromebot:” work, with or without the colon.

Command Description

chromebot: a <url> -r <policy> -j <concurrency>

Archive <url> with <concurrency> processes according to recursion <policy>.
chromebot: s <uuid> Get job status for <uuid>.
chromebot: r <uuid> Revoke or abort running job with <uuid>.

Please note that the commands are case-sensitive.

URL lists can be archived using recursion, for example:

chromebot: a https://transfer.notkiska.pw/inline/UpfR/HollyConrad-tweets -r 1 -j 4

chromebot will assume all lines starting with http(s):// are valid links. Note that the list itself must be retured by the server as an *inline* document, not as a download (attachment).

Restrictions

Instagram

chromebot has been blacklisted by Instagram. When trying to archive any Instagram.com website, chromebot responds with the following error:

<Instagram.com URL> cannot be queued: Banned by Instagram

Cloudflare DDoS protection

chromebot should be able to circumvent Cloudflare's DDoS protection, but scrolling and other behaviour may be disabled after the reload (issue #13 on GitHub).