Difference between revisions of "Chromebot"

From Archiveteam
Jump to navigation Jump to search
m (Mentioning bottomless websites (also known as “infinite scroll”).)
(Added mention of WBM exclusion to lede)
 
(17 intermediate revisions by 5 users not shown)
Line 1: Line 1:
chromebot is an [[IRC]] bot parallel to [[ArchiveBot]] that uses Google Chrome and thus is able to archive JavaScript-heavy and bottomless websites. Both, [https://github.com/PromyLOPh/crocoite software] and bot, are maintained by [[User:PurpleSymphony]]. WARCs are uploaded daily to the [https://archive.org/details/archiveteam_chromebot?sort=-publicdate chromebot collection] on archive.org.
'''chromebot''' aka. '''crocoite''' was an [[IRC]] bot parallel to [[ArchiveBot]] that uses Google Chrome and thus is able to archive JavaScript-heavy and bottomless websites. On 2021-04-21 the bot was shut down (see [[#Shutdown]]), and the captures it made are no longer in the Wayback Machine. [[WARC]]s were uploaded twice a day to the [https://archive.org/details/archiveteam_chromebot?sort=-publicdate chromebot collection] on archive.org. For a given item in the collection, you can see what URLs are saved in the warc by looking at the associated jobs.json.gz file.


By default the bot only grabs a single URL. However it supports recursion, which is rather slow, since every single page needs to be loaded and rendered by a browser. A [https://6xq.net/chromebot/ dashboard] is available for watching the progress of such jobs.
By default the bot only grabs a single URL. However it supports recursion, which is rather slow, since every single page needs to be loaded and rendered by a browser. A [http://chromebot.6xq.net/ dashboard] is available for watching the progress of such jobs.


== Usage<ref name=usage>[https://github.com/PromyLOPh/crocoite/blob/184189f0a535996edca01a68182ed07d32e26e9c/README.rst#IRC-bot ChromeBot usage documentation on GitHub]</ref> ==
== Usage ==
You can call ''chromebot'' on the {{IRC|archivebot}} IRC channel, which chromebot shares with it's parent [[ArchiveBot]]. Both “<code>chromebot</code>” and “<code>chromebot:</code>” work, with or without the colon. The username can be autocompleted using the “<kbd>↹</kbd>Tab” key in the EFNet web chat interface or IRC client.
[https://6xq.net/crocoite/usage/ crocoite usage documentation]


{| class="wikitable"
{| class="wikitable"
|-
|-
! Command !! Description
! Command !! Description
|-
| white-space: nowrap |
<code>chromebot: a <url> -r <policy> -j <concurrency></code>
|| Archive ''<url>'' with ''<concurrency>'' processes according to recursion ''<policy>''.
|-
|-
| <code>chromebot: a <uuid></code><br /><code>chromebot a <uuid></code> || Archive <url> with <concurrency> processes according to recursion <policy>.
| <code>chromebot: s <uuid></code></code> || Get job status for ''<uuid>''.
|-
|-
| <code>chromebot: s <uuid></code><br /><code>chromebot s <uuid></code> ||    Get job status for <uuid>.
| <code>chromebot: r <uuid></code></code> || Revoke or abort running job with ''<uuid>''.
|-
| <code>chromebot: r <uuid></code><br /><code>chromebot r <uuid></code> || Revoke or abort running job with <uuid>.
|}
|}


Please note that the commands are case-sensitive.
Please note that the commands are case-sensitive.
URL lists can be archived using recursion, for example:
<code>chromebot: a https://transfer.notkiska.pw/inline/UpfR/HollyConrad-tweets -r 1 -j 4</code>
chromebot will assume all lines starting with http(s):// are valid links. Note that the list itself must be returned by the server as an *inline* document, not as a download (attachment).


== Restrictions ==
== Restrictions ==
=== Instagram.com ===
=== Instagram ===
ChromeBot has been blacklisted by [[Instagram]], a website infamous for being an archival loophole.
chromebot has been blacklisted by [[Instagram]]. When trying to archive any Instagram.com website, chromebot responds with the following error:
''<Instagram.com URL> cannot be queued: Banned by Instagram''


When trying to archive any Instagram.com website, chromebot responds with the following error:
=== Cloudflare DDoS protection ===
''<Instagram.com URL> cannot be queued: Banned by Instagram''
chromebot should be able to circumvent Cloudflare's DDoS protection, but scrolling and other behaviour may be disabled after the reload ([https://github.com/PromyLOPh/crocoite/issues/13 issue #13 on GitHub]).
 
== People ==


One way to bypass Instagram's restrictions partially is using [http://Insta-Stalker.com/ Insta-Stalker.com], which is just a third-party web viewer for Instagram, equipped with an AJAX-free user search feature and the ability to view profiles without Instagram's new Web-App-type website (similar to [https://mobile.twitter.com/ Twitter Lite]) that made Instagram inaccessible to the [[Wayback Machine]] and [[Archive.Today]]'s crawlers. The former gets stuck in an infinite refresh loop.
[[User:PurpleSymphony|PurpleSym]] maintains [https://github.com/PromyLOPh/crocoite software], [https://github.com/PromyLOPh/chromebot scripts], pays the server bills and has administrative access. katocala is a server administrator.


'''URL format:'''
== Shutdown ==
* Search URL: https://insta-stalker.com/search/?q=<code>Search+Term+here</code>
In April 2021, it was discovered that the WARCs written by crocoite had incorrect dates. Namely, the revisit records received the date of the daily deduplication run rather than copying the date of retrieval from the replaced response record, leading to a misrepresentation of when the identical capture was found. Further, all records were presented as HTTP/1.1 with made-up headers, including ones using HTTP/2 or any other protocol supported by Chrome (e.g. WebSockets, HTTP/3). These major data integrity problems led to the bot's WARCs being removed from the Wayback Machine index and the bot being shut down indefinitely. The old revisit records' dates can likely not be fixed reliably because the log information is incomplete, hence a reversal of the WBM exclusion is unlikely.
* User URL (example): https://insta-stalker.com/profile/SamsungMobile/


== References ==
[[Category:Bots]]
<references />

Latest revision as of 13:46, 17 October 2021

chromebot aka. crocoite was an IRC bot parallel to ArchiveBot that uses Google Chrome and thus is able to archive JavaScript-heavy and bottomless websites. On 2021-04-21 the bot was shut down (see #Shutdown), and the captures it made are no longer in the Wayback Machine. WARCs were uploaded twice a day to the chromebot collection on archive.org. For a given item in the collection, you can see what URLs are saved in the warc by looking at the associated jobs.json.gz file.

By default the bot only grabs a single URL. However it supports recursion, which is rather slow, since every single page needs to be loaded and rendered by a browser. A dashboard is available for watching the progress of such jobs.

Usage

crocoite usage documentation

Command Description

chromebot: a <url> -r <policy> -j <concurrency>

Archive <url> with <concurrency> processes according to recursion <policy>.
chromebot: s <uuid> Get job status for <uuid>.
chromebot: r <uuid> Revoke or abort running job with <uuid>.

Please note that the commands are case-sensitive.

URL lists can be archived using recursion, for example:

chromebot: a https://transfer.notkiska.pw/inline/UpfR/HollyConrad-tweets -r 1 -j 4

chromebot will assume all lines starting with http(s):// are valid links. Note that the list itself must be returned by the server as an *inline* document, not as a download (attachment).

Restrictions

Instagram

chromebot has been blacklisted by Instagram. When trying to archive any Instagram.com website, chromebot responds with the following error:

<Instagram.com URL> cannot be queued: Banned by Instagram

Cloudflare DDoS protection

chromebot should be able to circumvent Cloudflare's DDoS protection, but scrolling and other behaviour may be disabled after the reload (issue #13 on GitHub).

People

PurpleSym maintains software, scripts, pays the server bills and has administrative access. katocala is a server administrator.

Shutdown

In April 2021, it was discovered that the WARCs written by crocoite had incorrect dates. Namely, the revisit records received the date of the daily deduplication run rather than copying the date of retrieval from the replaced response record, leading to a misrepresentation of when the identical capture was found. Further, all records were presented as HTTP/1.1 with made-up headers, including ones using HTTP/2 or any other protocol supported by Chrome (e.g. WebSockets, HTTP/3). These major data integrity problems led to the bot's WARCs being removed from the Wayback Machine index and the bot being shut down indefinitely. The old revisit records' dates can likely not be fixed reliably because the log information is incomplete, hence a reversal of the WBM exclusion is unlikely.