Wikibot

wikibot
IRC bot to run MediaWiki, DokuWiki and PukiWiki dumps
Status	Special case
Archiving status	In progress... (manual)
Archiving type	Other
Project source	wikibot
Project tracker	https://wikibot.digitaldragon.dev/
IRC channel	#wikibot (on hackint)
Project lead	User:DigitalDragon
Data^{[how to use]}	wikiteam (all wikiteam uploads) wikiteam_inbox_1 (recent wikibot-only uploads)

wikibot is an IRC bot that will dump the contents of MediaWiki, DokuWiki, and PukiWiki instances. These dumps are uploaded to the WikiTeam collection at the Internet Archive for preservation.

Details

wikibot runs in the #wikibot (on hackint) IRC channel. Various commands (explained below) are available to interact with the bot. In order to create and manage dumps, you'll need voice (+) or operator (@) permissions. If you don't have those, just ask in the channel and someone with permission will be able to help you. If your request gets missed, you may want to ask in the less noisy #wikiteam (on hackint) or #archiveteam-bs (on hackint) channels instead. A dashboard to view running jobs is available here.

If you're writing automation around wikibot, you can get a list of queued and running jobs at https://wikibot.digitaldragon.dev/api/jobs, get information about a specific job at https://wikibot.digitaldragon.dev/api/jobs/JOB_ID, get a websocket of updates to jobs at wss://wikibot.digitaldragon.dev/api/jobevents, and get a firehose of job logs at wss://wikibot.digitaldragon.dev/api/logfirehose. Please note that these APIs are not stable and will likely change in the future.

Commands

!help shows a help message.
!status [job ID] will show the status of a specific job, or a summary of all jobs currently running.
!abort <job ID> will stop a job that is currently running.
!reupload <job ID> can be used to retry a failed upload to the Internet Archive.
!check <search> will generate an Internet Archive search link for a provided domain name, to check if the wiki has already been downloaded. The bot will also run a search, and it requires a special parameter to download a wiki that has been dumped within the last year.
!bulk <url> will run all of the commands in the linked text file. Please note that !bulk only supports !mw, !dw, and !pw for now. Jobs will run with --silent-mode done unless otherwise specified. Avoid running large lists of jobs with --silent-mode all as this will flood the channel with messages about each job starting. https://transfer.archivete.am is the preferred place to upload files.
!savedb (ops only) saves job metadata to the database again, which is useful if the disk fills and some writes are missed

Queue Management

The bot has multiple different queues that jobs can go into. Queues each have a set concurrency (the maximum number of jobs in the queue that can run at once) and priority (a queue with a higher priority value will start jobs before a queue that has a lower priority value). If you're planning to run many wikis at once, especially if they're on the same wiki farm, please contact an operator to set up a special queue to avoid the bot getting banned or overwhelming the site.

!getqueue <queue> checks the concurrency and priority level of a given queue.
!setqueue <queue> <concurrency> <priority> (ops only) sets the concurrency and priority of a queue
!movejob <job ID> <queue> moves a job into the specified queue
!pause (ops only) stops all users from submitting new jobs

All job types

The following options can be used across all job types.

Option	Description
--url	The URL of the wiki to archive
--explain	Adds an explanation (or note) to the job
--delay	The delay between requests in seconds (NOT ms, unlike ArchiveBot)
--insecure	Ignore invalid HTTPS certificates and other SSL errors
--user-agent, -u	Changes the User-Agent sent by wikibot. This uses the same values ArchiveBot, plus `wikibot-short` for a version of the wikibot UA without chrome. If something is entered that isn't in the UA list, wikibot will use that string exactly as the User-Agent.
--force	Run the job even if an archive of the wiki from the last 365 days already exists
--resume	Job ID of a previous (failed) job to resume from. Useful if a job fails due to a temporary ban or other transient error.
--queue	Put this job into a specific queue (if unspecified, jobs go into the `default` queue)
--silent-mode	What notifications to get about a job. `all` will send all messages `end` will skip the "Queued job!" message (useful for `!bulk`) `fail` will only send a message if the job fails `done` will only send a message if the job was not recently or currently completed `silent` won't send any messages about the job

MediaWiki

Uses wikiteam3 to archive the contents of a MediaWiki wiki.

!mw <options> will dump a MediaWiki wiki.

Option	Description
--api, -A	Direct link to the `api.php` of the wiki. The bot will try to automatically detect this unless specified. (e.g. `https://wiki.example.com/w/api.php`
--index, -N	Direct link to the `index.php` of the wiki. The bot will try to automatically detect this unless specified. (e.g. `https://wiki.example.com/w/index.php`
--api_chunksize, -C	The number of pages, revisions, etc to ask for in each API request (default `50`, most wikis will ignore values above 50)
--index-check-threshold	Skip index.php check if likeliness for index.php to exist is (>) this value (default: `0.80`)
--xml, -x	Export an XML dump. If no other dump option is specified, it will use `Special:Export` to dump page content. It is highly recommended to use `--xmlapiexport` or `--xmlrevisions` if possible
--xmlapiexport, -a	Use the revisions API to export page XML
--xmlrevisions, -r	(Recommended) Use the allrevisions API to export page XML. This is the fastest and most efficient method, but is only supported on wikis using MediaWiki `1.27` or later
--images, -i	Include images in the dump. Recommended unless the images are over 500GiB in size as per the wiki's `Special:MediaStatistics` page
--redirects	Dumps the page redirect map using `API:Allredirects`. Redirects will normally be saved in XML, but this may be useful for wikis with strange redirect behavior.
--bypass-cdn-image-compression	Bypass lossy CDN image compression used by some wikis (ex. Cloudflare Polish)
--disable-image-verify, -V	Don't verify the image size and hash while downloading
--retries	How many times to retry each request before the job fails
--hard-retries	How many times to retry hard failures on requests (for example, interrupted connections)
--curonly, -n	Only download the latest revision of each page. Not compatible with `--xmlrevisions`

Usage Tips

KeyError: 'allrevisions' means that the wiki is too old to support --xmlrevisions, try --xmlapiexport or just --xml on its own instead.
If you get an error like ERROR: Unsupported wiki. Wiki engines supported are: MediaWiki but you're sure you tried to dump a MediaWiki:
- Try to find the index.php and api.php paths of the wiki and pass them directly to the bot. The Special:Version page will list the URLs for newer MediaWiki versions. Otherwise clicking the Edit, View history or login links will usually help find the index.php page. index.php and api.php are almost always interchangeable
- Open inspect element and look for a comment at the top about HTTrack. If this appears, the wiki is a static conversion, and ArchiveBot will be needed instead
A ChunkedEncodingError usually just means the job needs to be resumed. You may also want to consider a higher --hard-retries if the wiki seems to consistently have this problem (Fandom is an example that often has these issues)
If the wiki is in another language, replacing the page name in the URL with Special:Version or the English name of another special page will usually redirect you to the right place.
You can use the wiki's Special:Version page to check if it is new enough to use `--xmlrevisions`.
The --delay of a job can be changed while it is running by someone with access to the server wikibot runs on, so ask in the channel if you need to adjust the speed of a job.

DokuWiki

Uses DokuWiki Dumper to archive the contents of a DokuWiki wiki.

!dw <options> will dump a DokuWiki wiki.

Parameter	Description
--auto	Dump: content+media+html, threads=5, ignore-action-disable-edit. (threads is overridable)
--ignore-disposition-header-missing	Do not check Disposition header, useful for outdated (<2014) DokuWiki versions [default: False]
--threads	Number of sub threads to use [default: 1], not recommended to set > 5
--ignore-action-disabled-edit	Some sites disable edit action for anonymous users and some core pages. This option will ignore this error, and the textarea not found error. You may only get a partial dump. (only works with `--content`)
--current-only	Download only the latest revision of each page
--retry	How many times to retry each request before the job fails
--hard-retry	How many times to retry hard failures on requests (for example, interrupted connections)
--content	Dump content
--media	Dump media
--html	Dump HTML
--pdf	Dump PDFs on wikis with the PDF export plugin

Usage Tips

A bot bug will cause jobs to become permanently stuck if --auto and --delay X are used without --threads 1

PukiWiki

Uses PukiWiki Dumper to archive the contents of a PukiWiki wiki.

!pw <options> will dump a PukiWiki wiki.

Parameter	Description
--auto	Dump: content+media, threads=2, current-only. (threads is overridable)
--threads	Number of sub threads to use [default: 1], not recommended to set > 5
--ignore-action-disabled-edit	Some sites disable edit action for anonymous users and some core pages. This option will ignore this error, and the textarea not found error. You may only get a partial dump. (only works with `--content`)
--trim-php-warnings	Trim PHP warnings from responses
--verbose	verbose output
--current-only	Download only the latest revision of each page
--retry	How many times to retry each request before the job fails
--hard-retry	How many times to retry hard failures on requests (for example, interrupted connections)
--content	Dump content
--media	Dump media

Generating commands

check-wikis-emit-wikibot-cmds is a script that downloads web pages, checks them for supported wikis and for non-wikis checks outlinks for supported wikis. It knows when to use the different types of options that are often dependent on the wiki software version etc.

Wikibot

Contents

Details

Commands

Queue Management

All job types

MediaWiki

Usage Tips

DokuWiki

Usage Tips

PukiWiki

Generating commands

Navigation menu

Wikibot

Details

Commands

Queue Management

All job types

MediaWiki

Usage Tips

DokuWiki

Usage Tips

PukiWiki

Generating commands

Navigation menu

Search