Difference between revisions of "Chromebot"

From Archiveteam
Jump to navigation Jump to search
m (→‎UsageChromeBot usage documentation on GitHub: Rearrangement. Looks less distorted when reading commands in table.)
(Remove the irrelevant noise)
Line 1: Line 1:
chromebot is an [[IRC]] bot parallel to [[ArchiveBot]] that uses Google Chrome and thus is able to archive JavaScript-heavy and bottomless websites. Both, [https://github.com/PromyLOPh/crocoite software] and bot, are maintained by [[User:PurpleSymphony]]. [[WARC]]s are uploaded daily to the [https://archive.org/details/archiveteam_chromebot?sort=-publicdate chromebot collection] on archive.org.
'''chromebot''' aka. '''crocoite''' is an [[IRC]] bot parallel to [[ArchiveBot]] that uses Google Chrome and thus is able to archive JavaScript-heavy and bottomless websites. Both, [https://github.com/PromyLOPh/crocoite software] and bot, are maintained by [[User:PurpleSymphony]]. [[WARC]]s are uploaded daily to the [https://archive.org/details/archiveteam_chromebot?sort=-publicdate chromebot collection] on archive.org.


By default the bot only grabs a single URL. However it supports recursion, which is rather slow, since every single page needs to be loaded and rendered by a browser. A [https://6xq.net/chromebot/ dashboard] is available for watching the progress of such jobs.
By default the bot only grabs a single URL. However it supports recursion, which is rather slow, since every single page needs to be loaded and rendered by a browser. A [https://6xq.net/chromebot/ dashboard] is available for watching the progress of such jobs.


== Usage<ref name=usage>[https://github.com/PromyLOPh/crocoite/blob/184189f0a535996edca01a68182ed07d32e26e9c/README.rst#IRC-bot ChromeBot usage documentation on GitHub]</ref> ==
== Usage ==
You can call ''chromebot'' on the {{IRC|archivebot}} IRC channel, which chromebot shares with it's parent [[ArchiveBot]]. Both “<code>chromebot</code>” and “<code>chromebot:</code>” work, with or without the colon. The username can be autocompleted using the “<kbd>↹</kbd>Tab” key in the EFNet web chat interface or IRC client.
[https://github.com/PromyLOPh/crocoite/blob/184189f0a535996edca01a68182ed07d32e26e9c/README.rst#IRC-bot crocoite usage documentation on GitHub]
 
You can call ''chromebot'' on the {{IRC|archivebot}} IRC channel, which chromebot shares with [[ArchiveBot]]. Both “<code>chromebot</code>” and “<code>chromebot:</code>” work, with or without the colon.


{| class="wikitable"
{| class="wikitable"
Line 20: Line 22:
|| Archive ''<url>'' with ''<concurrency>'' processes according to recursion ''<policy>''.
|| Archive ''<url>'' with ''<concurrency>'' processes according to recursion ''<policy>''.
|-
|-
| <code>chromebot: s <uuid></code><br /><code>chromebot s <uuid></code> ||     Get job status for ''<uuid>''.
| <code>chromebot: s <uuid></code><br /><code>chromebot s <uuid></code> || Get job status for ''<uuid>''.
|-
|-
| <code>chromebot: r <uuid></code><br /><code>chromebot r <uuid></code> || Revoke or abort running job with ''<uuid>''.
| <code>chromebot: r <uuid></code><br /><code>chromebot r <uuid></code> || Revoke or abort running job with ''<uuid>''.
Line 29: Line 31:
== Restrictions ==
== Restrictions ==
=== Instagram.com ===
=== Instagram.com ===
ChromeBot has been blacklisted by [[Instagram]], a website infamous for being an archival loophole.
chromebot has been blacklisted by [[Instagram]]. When trying to archive any Instagram.com website, chromebot responds with the following error:
 
When trying to archive any Instagram.com website, chromebot responds with the following error:
  ''<Instagram.com URL> cannot be queued: Banned by Instagram''
  ''<Instagram.com URL> cannot be queued: Banned by Instagram''


One way to bypass Instagram's restrictions partially is using [http://Insta-Stalker.com/ Insta-Stalker.com], which is just a third-party web viewer for Instagram, equipped with an AJAX-free user search feature and the ability to view profiles without Instagram's new Web-App-type website (similar to [https://mobile.twitter.com/ Twitter Lite]) that made Instagram inaccessible to the [[Wayback Machine]] and [[Archive.Today]]'s crawlers. The former gets stuck in an infinite refresh loop.
=== Cloudflare DDoS protection ===
 
chromebot should be able to circumvent Cloudflare's DDoS protection, but scrolling and other behaviour may be disabled after the reload ([https://github.com/PromyLOPh/crocoite/issues/13 issue #13 on GitHub]).
'''URL format:'''
* Search URL: https://insta-stalker.com/search/?q=<code>Search+Term+here</code>
* User URL (example): https://insta-stalker.com/profile/SamsungMobile/
 
A way to bypass Instagram's restriction using [[ArchiveBot]], which is not blocked from Instagram, is using the ''[[snscrape]]'' tool to put the URLs of the posts into a text file list that, uploaded to https://transfer.sh/ or https://transfer.notkiska.pw/ , that can be consumed by ArchiveBot's <code>!ao < <link to list file></code> command.<br />Pages captured from Instagram store the information, but can not be viewed in the version injected into the Wayback Machine, which gets stuck in an infinite refresh loop due to Instagram's heavy usage of JavaScript (web-app type).
 
 
=== CloudFlare DDoS protection ===
Another obstacle for both this bot and [[ArchiveBot]] is CloudFlare's DDoS protection, which could prevent the bots from capturing a webpage.


== References ==
== References ==
<references />
<references />


[[Category:Bots]]
[[Category:Bots]]

Revision as of 22:53, 9 May 2019

chromebot aka. crocoite is an IRC bot parallel to ArchiveBot that uses Google Chrome and thus is able to archive JavaScript-heavy and bottomless websites. Both, software and bot, are maintained by User:PurpleSymphony. WARCs are uploaded daily to the chromebot collection on archive.org.

By default the bot only grabs a single URL. However it supports recursion, which is rather slow, since every single page needs to be loaded and rendered by a browser. A dashboard is available for watching the progress of such jobs.

Usage

crocoite usage documentation on GitHub

You can call chromebot on the #archivebot (on hackint) IRC channel, which chromebot shares with ArchiveBot. Both “chromebot” and “chromebot:” work, with or without the colon.

Command Description

chromebot: a <url>
chromebot: a <url> <concurrency>
chromebot: a <url> <concurrency> <policy>
chromebot a <url>
chromebot a <url> <concurrency>
chromebot a <url> <concurrency> <policy>

Archive <url> with <concurrency> processes according to recursion <policy>.
chromebot: s <uuid>
chromebot s <uuid>
Get job status for <uuid>.
chromebot: r <uuid>
chromebot r <uuid>
Revoke or abort running job with <uuid>.

Please note that the commands are case-sensitive.

Restrictions

Instagram.com

chromebot has been blacklisted by Instagram. When trying to archive any Instagram.com website, chromebot responds with the following error:

<Instagram.com URL> cannot be queued: Banned by Instagram

Cloudflare DDoS protection

chromebot should be able to circumvent Cloudflare's DDoS protection, but scrolling and other behaviour may be disabled after the reload (issue #13 on GitHub).

References