Chromebot

From Archiveteam
Jump to navigation Jump to search

chromebot is an IRC bot parallel to ArchiveBot that uses Google Chrome and thus is able to archive JavaScript-heavy and bottomless websites. Both, software and bot, are maintained by User:PurpleSymphony. WARCs are uploaded daily to the chromebot collection on archive.org.

By default the bot only grabs a single URL. However it supports recursion, which is rather slow, since every single page needs to be loaded and rendered by a browser. A dashboard is available for watching the progress of such jobs.

Usage[1]

You can call chromebot on the #archivebot (on hackint) IRC channel, which chromebot shares with it's parent ArchiveBot. Both “chromebot” and “chromebot:” work, with or without the colon. The username can be autocompleted using the “Tab” key in the EFNet web chat interface or IRC client.

Command Description
chromebot: a <url>

chromebot a <url>
chromebot: a <url> <concurrency>
chromebot a <url> <concurrency>
chromebot: a <url> <concurrency> <policy>
chromebot a <url> <concurrency> <policy>

Archive <url> with <concurrency> processes according to recursion <policy>.
chromebot: s <uuid>
chromebot s <uuid>
Get job status for <uuid>.
chromebot: r <uuid>
chromebot r <uuid>
Revoke or abort running job with <uuid>.

Please note that the commands are case-sensitive.

Restrictions

Instagram.com

ChromeBot has been blacklisted by Instagram, a website infamous for being an archival loophole.

When trying to archive any Instagram.com website, chromebot responds with the following error:

<Instagram.com URL> cannot be queued: Banned by Instagram

One way to bypass Instagram's restrictions partially is using Insta-Stalker.com, which is just a third-party web viewer for Instagram, equipped with an AJAX-free user search feature and the ability to view profiles without Instagram's new Web-App-type website (similar to Twitter Lite) that made Instagram inaccessible to the Wayback Machine and Archive.Today's crawlers. The former gets stuck in an infinite refresh loop.

URL format:

A way to bypass Instagram's restriction using ArchiveBot, which is not blocked from Instagram, is using the snscrape tool to put the URLs of the posts into a text file list that, uploaded to https://transfer.sh/ or https://transfer.notkiska.pw/ , that can be consumed by ArchiveBot's !ao < <link to list file> command.
Pages captured from Instagram store the information, but can not be viewed in the version injected into the Wayback Machine, which gets stuck in an infinite refresh loop due to Instagram's heavy usage of JavaScript (web-app type).


CloudFlare DDoS protection

Another obstacle for both this bot and ArchiveBot is CloudFlare's DDoS protection, which could prevent the bots from capturing a webpage.

References