Wikibot

From Archiveteam
Jump to navigation Jump to search
wikibot
IRC bot to run MediaWiki, DokuWiki and PukiWiki dumps
IRC bot to run MediaWiki, DokuWiki and PukiWiki dumps
Status Special case
Archiving status In progress... (manual)
Archiving type other
Project source wikibot
IRC channel #wikibot (on hackint)
Project lead User:DigitalDragon
Data[how to use] wikiteam (all wikiteam uploads) wikiteam_inbox_1 (recent wikibot-only uploads)

wikibot is an IRC bot that will dump the contents of MediaWiki, DokuWiki, and PukiWiki instances. These dumps are uploaded to the WikiTeam collection at the Internet Archive for preservation.

Details

wikibot runs in the #wikibot (on hackint) IRC channel. Various commands (explained below) are available to interact with the bot. In order to create and manage dumps, you'll need voice (+) or operator (@) permissions. If you don't have those, just ask in the channel and someone with permission will be able to help you. If your request gets missed, you may want to ask in the less noisy #wikiteam (on hackint) or #archiveteam-bs (on hackint) channels instead. A dashboard to view running jobs is available here.

If you're writing automation around wikibot, you can get a list of queued and running jobs at https://wikibot.digitaldragon.dev/api/jobs, get information about a specific job at https://wikibot.digitaldragon.dev/api/jobs/JOB_ID, get a websocket of updates to jobs at wss://wikibot.digitaldragon.dev/api/jobevents, and get a firehose of job logs at wss://wikibot.digitaldragon.dev/api/logfirehose. Please note that these APIs are not stable and will likely change in the future.

Commands

  • !help shows a help message.
  • !status [job ID] will show the status of a specific job, or a summary of all jobs currently running.
  • !abort <job ID> will stop a job that is currently running.
  • !reupload <job ID> can be used to retry a failed upload to the Internet Archive.
  • !check <search> will generate an Internet Archive search link for a provided domain name, to check if the wiki has already been downloaded. The bot will also run a search, and it requires a special parameter to download a wiki that has been dumped within the last year.
  • !bulk <url> will run all of the commands in the linked text file. Please note that !bulk only supports !mw, !dw, and !pw for now. Jobs will run with --silent-mode fail unless otherwise specified. Avoid running large lists of jobs with --silent-mode all as this will flood the channel with messages about each job starting. https://transfer.archivete.am is the preferred place to upload files.

Queue Management

The bot has multiple different queues that jobs can go into. Queues each have a set concurrency (the maximum number of jobs in the queue that can run at once) and priority (a queue with a higher priority value will start jobs before a queue that has a lower priority value). If you're planning to run many wikis at once, especially if they're on the same wiki farm, please contact an operator to set up a special queue to avoid the bot getting banned or overwhelming the site.

  • !getqueue <queue> checks the concurrency and priority level of a given queue.
  • !setqueue <queue> <concurrency> <priority> (ops only) sets the concurrency and priority of a queue
  • !movejob <job ID> <queue> moves a job into the specified queue
  • !pause (ops only) stops all users from submitting new jobs

All job types

The following options can be used across all job types.

Option Description
--url The URL of the wiki to archive
--explain Adds an explanation (or note) to the job
--delay The delay between requests in seconds (NOT ms, unlike ArchiveBot)
--insecure Ignore invalid HTTPS certificates and other SSL errors
--force Run the job even if an archive of the wiki from the last 365 days already exists
--resume Job ID of a previous (failed) job to resume from. Useful if a job fails due to a temporary ban or other transient error.
--queue Put this job into a specific queue (if unspecified, jobs go into the default queue)
--silent-mode
What notifications to get about a job.
all will send all messages
end will skip the "Queued job!" message (useful for !bulk)
fail will only send a message if the job fails
silentwon't send any messages about the job

MediaWiki

Uses wikiteam3 to archive the contents of a MediaWiki wiki.

  • !mw <options> will dump a MediaWiki wiki.
Option Description
--api, -A
Direct link to the api.php of the wiki. The bot will try to automatically detect this unless specified. (e.g. https://wiki.example.com/w/api.php
--index, -N Direct link to the index.php of the wiki. The bot will try to automatically detect this unless specified. (e.g. https://wiki.example.com/w/index.php
--api_chunksize, -C
The number of pages, revisions, etc to ask for in each API request (default 50, most wikis will ignore values above 50)
--index-check-threshold Skip index.php check if likeliness for index.php to exist is (>) this value (default: 0.80)
--xml, -x Export an XML dump. If no other dump option is specified, it will use Special:Export to dump page content. It is highly recommended to use --xmlapiexport or --xmlrevisions if possible
--xmlapiexport, -a Use the revisions API to export page XML
--xmlrevisions, -r (Recommended) Use the allrevisions API to export page XML. This is the fastest and most efficient method, but is only supported on wikis using MediaWiki 1.27 or later
--images, -i Include images in the dump. Recommended unless the images are over 500GiB in size as per the wiki's Special:MediaStatistics page
--bypass-cdn-image-compression Bypass lossy CDN image compression used by some wikis (ex. Cloudflare Polish)
--disable-image-verify, -V Don't verify the image size and hash while downloading
--retries How many times to retry each request before the job fails
--hard-retries How many times to retry hard failures on requests (for example, interrupted connections)
--curonly, -n Only download the latest revision of each page. Not compatible with --xmlrevisions

Usage Tips

  • KeyError: 'allrevisions' means that the wiki is too old to support --xmlrevisions, try --xmlapiexport or just --xml on its own instead.
  • If you get an error like ERROR: Unsupported wiki. Wiki engines supported are: MediaWiki but you're sure you tried to dump a MediaWiki:
    • Try to find the index.php and api.php paths of the wiki and pass them directly to the bot. Clicking the login link will usually help find the index.php page. index.php and api.php are almost always interchangeable
    • Open inspect element and look for a comment at the top about HTTrack. If this appears, the wiki is a static conversion, and ArchiveBot will be needed instead
  • A ChunkedEncodingError usually just means the job needs to be resumed. You may also want to consider a higher --hard-retries if the wiki seems to consistently have this problem (Fandom is an example that often has these issues)
  • If the wiki is in another language, replacing the page name in the URL with Special:Version or the English name of another special page will usually redirect you to the right place.
  • You can use the wiki's Special:Version page to check if it is new enough to use `--xmlrevisions`.
  • The --delay of a job can be changed while it is running by someone with access to the server wikibot runs on, so ask in the channel if you need to adjust the speed of a job.

DokuWiki

Uses DokuWiki Dumper to archive the contents of a DokuWiki wiki.

  • !dw <options> will dump a DokuWiki wiki.
Parameter Description
--auto Dump: content+media+html, threads=5, ignore-action-disable-edit. (threads is overridable)
--ignore-disposition-header-missing Do not check Disposition header, useful for outdated (<2014) DokuWiki versions [default: False]
--threads Number of sub threads to use [default: 1], not recommended to set > 5
--ignore-action-disabled-edit Some sites disable edit action for anonymous users and some core pages. This option will ignore this error, and the textarea not found error. You may only get a partial dump. (only works with --content)
--current-only Download only the latest revision of each page
--retry How many times to retry each request before the job fails
--hard-retry How many times to retry hard failures on requests (for example, interrupted connections)
--content Dump content
--media Dump media
--html Dump HTML
--pdf Dump PDFs on wikis with the PDF export plugin

PukiWiki

Uses PukiWiki Dumper to archive the contents of a PukiWiki wiki.

  • !pw <options> will dump a PukiWiki wiki.
Parameter Description
--auto Dump: content+media, threads=2, current-only. (threads is overridable)
--threads Number of sub threads to use [default: 1], not recommended to set > 5
--ignore-action-disabled-edit Some sites disable edit action for anonymous users and some core pages. This option will ignore this error, and the textarea not found error. You may only get a partial dump. (only works with --content)
--trim-php-warnings Trim PHP warnings from responses
--verbose verbose output
--current-only Download only the latest revision of each page
--retry How many times to retry each request before the job fails
--hard-retry How many times to retry hard failures on requests (for example, interrupted connections)
--content Dump content
--media Dump media