User:TheTechRobo/Mnbot

From Archiveteam
Jump to navigation Jump to search

Documentation is a work-in-progress.

Background information

Pipelines

mnbot is designed to be scalable. More pipelines can easily be added, and while I haven't yet tested it, a single pipeline can be configured to process multiple items in parallel.

Assuming it already has Docker installed, a pipeline requires roughly 1.5GB of space to build all the containers. As for temporary space, I don't yet have a good grasp of how much space it needs, but more is obviously better :-)

The Queue

Items are dequeued by priority (lower priorities run first). Items with the same priority use FIFO.

Effective priority (what is actually used for computing order) is defined as priority + tries. Jobs currently get tried up to 3 times before failing.

Using the bot

!brozzle, !b

Brozzles a page. Recursion is not yet supported.

Requires one of +v or +o. +v is provided to all users registered with NickServ.

Possible options:

  • --custom-js: Operators (+o) only. URL to a custom JavaScript file to run during the crawl. See the #Custom JavaScript section below.
  • --explanation/-e: Saves an explanation in the database for the crawl. This can be changed later with the !explain command.
  • --user-agent/-u: Sets user agent header behaviour. Can take one of the following values:
    • default: Acts like Chrome on Windows, and appends (mnbot VERSION; +https://wiki.archiveteam.org/index.php/User:TheTechRobo/Chromebot).
    • stealth: Acts like Chrome on Windows, but doesn't add information about mnbot.
    • minimal: Sets the user agent to mnbot VERSION (+https://wiki.archiveteam.org/index.php/User:TheTechRobo/Chromebot)
    • curl: Sets the user agent to that of Curl.
    • archivebot: Sets the user agent to one like ArchiveBot.
    • googlebot: Sets the user agent to one like Googlebot Desktop.
    • googlebot1: An older and outdated variation of Googlebot: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

!explain, !e

Sets the explanation for an item.

Usage: !explain <IDENT> [EXPLANATION...]

!status

Forms: !status, !status [IDENTS...]

If IDENTS is provided, returns the status of each ident. Otherwise, returns queue status.

!limbo

Gets items in limbo (claimed for more than a day). If an item is here, it's likely the pipeline crashed, and you can probably use !!reclaim to reclaim the item (see Administration commands below).

!whereis, !w

Retrieves the last pipeline that claimed a certain job.

Custom JavaScript

Custom JavaScript behaviour can be added by using the --custom-js argument to !b.

The script is run after the screenshot is taken, but before brozzler is instructed to search for outlinks or visit anchors. The return value (the last line) of the script is saved in the WARC and available on the item viewer webpage (see below) -- useful for generating a list of URLs for ArchiveBot, for example. If you want to use the return value, it must be JSON-encodable or you will run into issues. The file must begin with //! mnbot v1 followed by a newline.

The script is run in REPL mode, which means top-level await and let redeclaration are both allowed. Basically, it runs it as if you typed it into the browser console.

Result format

JSON object:

  • status: success if mnbot believes it to be successful, exception if an exception occurred, unknown if unknown
    • If the status is unknown, the only other field will be fullResult which contains the full result as sent from Chrome
  • exceptionDetails: if an exception occurred, a CDP ExceptionDetails object; otherwise, null
  • remoteObject: a CDP RemoteObject of the script's return value

Administration commands

These commands require +o to use. Some of them are dangerous! (Not so much right now, but maybe in the future.) Please make sure you aren't doing something you don't want to do before pressing enter :-)

Note that these commands start with two exclamation marks to denote them as administration commands.

!!reclaim

Quietly fails an item. No attempt is made to notify whatever pipeline is running it.

Use this if an item is stuck. If you use this on an item that is actually still running, the item will simply be completed twice, so no harm done.

!!dripfeed <STASH> <CONCURRENCY>

Sets dripfeed behaviour for a stash. mnbot will periodically move items from the stash into the queue, in regular dequeuing order. No more than CONCURRENCY items from the stash will be in the queue at once.

The exact rate at which mnbot dripfeeds items is undefined, but the concurrency limit will always be respected.

Set CONCURRENCY to 0 to disable dripfeeding. Items that have already been dripfed (even if not yet claimed) will not be affected.

Neither stashes nor dripfeeding are currently functional, so this currently has no effect.

Item viewer

A useful dashboard is WIP.

For viewing item details, including item results, use https://mnbot.very-good-quality-co.de/item/$ITEM (e.g. https://mnbot.very-good-quality-co.de/item/d2d824fc-744f-483a-8176-670c5cc63c9e). This is also included in the status message. JSON output can be achieved with `Accept: application/json`. The JSON schema isn't currently stable and may change.

Current active pipelines

If a website employs IP reputation checks or geoblocking, try queuing to a specific pipeline. If no pipeline is specified, the tracker may hand the item out to any pipeline.

Be sure to follow any appropriate laws/regulations when using the bot. Content must be legal in New York and whatever region the pipeline is in.

Pipeline ID Hosting provider Region Notes
racknerd Racknerd New York