Difference between revisions of "Chromebot"

From Archiveteam
Jump to navigation Jump to search
m (Adding “white-space: nowrap;” CSS property to avoid linebreaks in the table for the “command” column. See: #usage.)
Line 9: Line 9:
|-
|-
! Command !! Description
! Command !! Description
|-
|-  
| <code>chromebot: a <uuid></code><br /><code>chromebot a <uuid></code>  || Archive <url> with <concurrency> processes according to recursion <policy>.
| white-space: nowrap | <code>chromebot: a <uuid></code><br /><code>chromebot a <uuid></code>  || Archive <url> with <concurrency> processes according to recursion <policy>.
|-
|-
| <code>chromebot: s <uuid></code><br /><code>chromebot s <uuid></code> ||    Get job status for <uuid>.
| <code>chromebot: s <uuid></code><br /><code>chromebot s <uuid></code> ||    Get job status for <uuid>.

Revision as of 18:39, 30 April 2019

chromebot is an IRC bot parallel to ArchiveBot that uses Google Chrome and thus is able to archive JavaScript-heavy and bottomless websites. Both, software and bot, are maintained by User:PurpleSymphony. WARCs are uploaded daily to the chromebot collection on archive.org.

By default the bot only grabs a single URL. However it supports recursion, which is rather slow, since every single page needs to be loaded and rendered by a browser. A dashboard is available for watching the progress of such jobs.

Usage[1]

You can call chromebot on the #archivebot (on hackint) IRC channel, which chromebot shares with it's parent ArchiveBot. Both “chromebot” and “chromebot:” work, with or without the colon. The username can be autocompleted using the “Tab” key in the EFNet web chat interface or IRC client.

Command Description
chromebot: a <uuid>
chromebot a <uuid>
Archive <url> with <concurrency> processes according to recursion <policy>.
chromebot: s <uuid>
chromebot s <uuid>
Get job status for <uuid>.
chromebot: r <uuid>
chromebot r <uuid>
Revoke or abort running job with <uuid>.

Please note that the commands are case-sensitive.

Restrictions

Instagram.com

ChromeBot has been blacklisted by Instagram, a website infamous for being an archival loophole.

When trying to archive any Instagram.com website, chromebot responds with the following error:

<Instagram.com URL> cannot be queued: Banned by Instagram

One way to bypass Instagram's restrictions partially is using Insta-Stalker.com, which is just a third-party web viewer for Instagram, equipped with an AJAX-free user search feature and the ability to view profiles without Instagram's new Web-App-type website (similar to Twitter Lite) that made Instagram inaccessible to the Wayback Machine and Archive.Today's crawlers. The former gets stuck in an infinite refresh loop.

URL format:

References