User:TheTechRobo/Mnbot
Documentation is a work-in-progress.
Background information
Pipelines
mnbot is designed to be scalable. More pipelines can easily be added, and while I haven't yet tested it, a single pipeline can be configured to process multiple items in parallel.
Assuming it already has Docker installed, a pipeline requires roughly 1.5GB of space to build all the containers. As for temporary space, I don't yet have a good grasp of how much space it needs, but more is obviously better :-)
The Queue
Items are dequeued by priority (lower priorities run first). Items with the same priority use FIFO.
Effective priority (what is actually used for computing order) is defined as priority + tries. Jobs currently get tried up to 3 times before failing.
Using the bot
!brozzle, !b
Brozzles a page. Recursion is not yet supported.
Requires one of +v
or +o
. +v
is provided to all users registered with NickServ.
Possible options:
--custom-js
: Operators (+o
) only. URL to a custom JavaScript file to run during the crawl. See the #Custom JavaScript section below.--explanation
/-e
: Saves an explanation in the database for the crawl. This can be changed later with the!explain
command.--user-agent
/-u
: Sets user agent header behaviour. Can take one of the following values:default
: Acts like Chrome on Windows, and appends(mnbot VERSION; +https://wiki.archiveteam.org/index.php/User:TheTechRobo/Chromebot)
.stealth
: Acts like Chrome on Windows, but doesn't add information about mnbot.minimal
: Sets the user agent tomnbot VERSION (+https://wiki.archiveteam.org/index.php/User:TheTechRobo/Chromebot)
curl
: Sets the user agent to that of Curl.archivebot
: Sets the user agent to one like ArchiveBot.googlebot
: Sets the user agent to one like Googlebot Desktop.googlebot1
: An older and outdated variation of Googlebot:Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
!explain, !e
Sets the explanation for an item.
Usage: !explain <IDENT> [EXPLANATION...]
!status
Forms: !status
, !status [IDENTS...]
If IDENTS is provided, returns the status of each ident. Otherwise, returns queue status.
!limbo
Gets items in limbo (claimed for more than a day). If an item is here, it's likely the pipeline crashed, and you can probably use !!reclaim
to reclaim the item (see Administration commands below).
!whereis, !w
Retrieves the last pipeline that claimed a certain job.
Custom JavaScript
Custom JavaScript behaviour can be added by using the --custom-js
argument to !b
.
The script is run after the screenshot is taken, but before brozzler is instructed to search for outlinks or visit anchors. The return value (the last line) of the script is saved in the WARC and available on the item viewer webpage (see below) -- useful for generating a list of URLs for ArchiveBot, for example. If you want to use the return value, it must be JSON-encodable or you will run into issues. The file must begin with //! mnbot v1
followed by a newline.
The script is run in REPL mode, which means top-level await and let redeclaration are both allowed. Basically, it runs it as if you typed it into the browser console.
Result format
JSON object:
status
: success if mnbot believes it to be successful,exception
if an exception occurred,unknown
if unknown- If the status is
unknown
, the only other field will befullResult
which contains the full result as sent from Chrome
- If the status is
exceptionDetails
: if an exception occurred, a CDP ExceptionDetails object; otherwise, nullremoteObject
: a CDP RemoteObject of the script's return value
Administration commands
These commands require +o to use. Some of them are dangerous! (Not so much right now, but maybe in the future.) Please make sure you aren't doing something you don't want to do before pressing enter :-)
Note that these commands start with two exclamation marks to denote them as administration commands.
!!reclaim
Quietly fails an item. No attempt is made to notify whatever pipeline is running it.
Use this if an item is stuck. If you use this on an item that is actually still running, the item will simply be completed twice, so no harm done.
!!dripfeed <STASH> <CONCURRENCY>
Sets dripfeed behaviour for a stash. mnbot will periodically move items from the stash into the queue, in regular dequeuing order. No more than CONCURRENCY items from the stash will be in the queue at once.
The exact rate at which mnbot dripfeeds items is undefined, but the concurrency limit will always be respected.
Set CONCURRENCY to 0 to disable dripfeeding. Items that have already been dripfed (even if not yet claimed) will not be affected.
Neither stashes nor dripfeeding are currently functional, so this currently has no effect.
Item viewer
A useful dashboard is WIP.
For viewing item details, including item results, use https://mnbot.very-good-quality-co.de/item/$ITEM
(e.g. https://mnbot.very-good-quality-co.de/item/d2d824fc-744f-483a-8176-670c5cc63c9e). This is also included in the status message. JSON output can be achieved with `Accept: application/json`. The JSON schema isn't currently stable and may change.
Current active pipelines
If a website employs IP reputation checks or geoblocking, try queuing to a specific pipeline. If no pipeline is specified, the tracker may hand the item out to any pipeline.
Be sure to follow any appropriate laws/regulations when using the bot. Content must be legal in New York and whatever region the pipeline is in.
Pipeline ID | Hosting provider | Region | Notes |
---|---|---|---|
racknerd |
Racknerd | New York |