Wikibot
wikibot | |
IRC bot to run MediaWiki, DokuWiki and PukiWiki dumps | |
Status | Special case |
Archiving status | In progress... (manual) |
Archiving type | other |
Project source | wikibot |
IRC channel | #wikibot (on hackint) |
Project lead | User:DigitalDragon |
Data[how to use] | wikiteam (all wikiteam uploads) wikiteam_inbox_1 (recent wikibot-only uploads) |
wikibot is an IRC bot that will dump the contents of MediaWiki, DokuWiki, and PukiWiki instances. These dumps are uploaded to the WikiTeam collection at the Internet Archive for preservation.
Details
wikibot runs in the #wikibot (on hackint) IRC channel. Various commands (explained below) are available to interact with the bot. In order to create and manage dumps, you'll need voice (+
) or operator (@
) permissions. If you don't have those, just ask in the channel and someone with permission will be able to help you. If your request gets missed, you may want to ask in the less noisy #wikiteam (on hackint) or #archiveteam-bs (on hackint) channels instead. A dashboard to view running jobs is available here.
If you're writing automation around wikibot, you can get a list of queued and running jobs at https://wikibot.digitaldragon.dev/api/jobs
, get information about a specific job at https://wikibot.digitaldragon.dev/api/jobs/JOB_ID
, get a websocket of updates to jobs at wss://wikibot.digitaldragon.dev/api/jobevents
, and get a firehose of job logs at wss://wikibot.digitaldragon.dev/api/logfirehose
. Please note that these APIs are not stable and will likely change in the future.
Commands
!help
shows a help message.!status [job ID]
will show the status of a specific job, or a summary of all jobs currently running.!abort <job ID>
will stop a job that is currently running.!reupload <job ID>
can be used to retry a failed upload to the Internet Archive.!check <search>
will generate an Internet Archive search link for a provided domain name, to check if the wiki has already been downloaded. The bot will also run a search, and it requires a special parameter to download a wiki that has been dumped within the last year.!bulk <url>
will run all of the commands in the linked text file. Please note that !bulk only supports!mw
,!dw
, and!pw
for now. Jobs will run with--silent-mode fail
unless otherwise specified. Avoid running large lists of jobs with--silent-mode all
as this will flood the channel with messages about each job starting. https://transfer.archivete.am is the preferred place to upload files.
Queue Management
The bot has multiple different queues that jobs can go into. Queues each have a set concurrency (the maximum number of jobs in the queue that can run at once) and priority (a queue with a higher priority value will start jobs before a queue that has a lower priority value). If you're planning to run many wikis at once, especially if they're on the same wiki farm, please contact an operator to set up a special queue to avoid the bot getting banned or overwhelming the site.
!getqueue <queue>
checks the concurrency and priority level of a given queue.!setqueue <queue> <concurrency> <priority>
(ops only) sets the concurrency and priority of a queue!movejob <job ID> <queue>
moves a job into the specified queue!pause
(ops only) stops all users from submitting new jobs
All job types
The following options can be used across all job types.
Option | Description |
---|---|
--url | The URL of the wiki to archive |
--explain | Adds an explanation (or note) to the job |
--delay | The delay between requests in seconds (NOT ms, unlike ArchiveBot) |
--insecure | Ignore invalid HTTPS certificates and other SSL errors |
--force | Run the job even if an archive of the wiki from the last 365 days already exists |
--resume | Job ID of a previous (failed) job to resume from. Useful if a job fails due to a temporary ban or other transient error. |
--queue | Put this job into a specific queue (if unspecified, jobs go into the default queue)
|
--silent-mode |
What notifications to get about a job.all will send all messagesend will skip the "Queued job!" message (useful for !bulk )fail will only send a message if the job failssilent won't send any messages about the job |
MediaWiki
Uses wikiteam3 to archive the contents of a MediaWiki wiki.
!mw <options>
will dump a MediaWiki wiki.
Option | Description |
---|---|
--api, -A |
Direct link to the api.php of the wiki. The bot will try to automatically detect this unless specified. (e.g. https://wiki.example.com/w/api.php
|
--index, -N | Direct link to the index.php of the wiki. The bot will try to automatically detect this unless specified. (e.g. https://wiki.example.com/w/index.php
|
--api_chunksize, -C |
The number of pages, revisions, etc to ask for in each API request (default 50 , most wikis will ignore values above 50)
|
--index-check-threshold | Skip index.php check if likeliness for index.php to exist is (>) this value (default: 0.80 )
|
--xml, -x | Export an XML dump. If no other dump option is specified, it will use Special:Export to dump page content. It is highly recommended to use --xmlapiexport or --xmlrevisions if possible
|
--xmlapiexport, -a | Use the revisions API to export page XML |
--xmlrevisions, -r | (Recommended) Use the allrevisions API to export page XML. This is the fastest and most efficient method, but is only supported on wikis using MediaWiki 1.27 or later
|
--images, -i | Include images in the dump. Recommended unless the images are over 500GiB in size as per the wiki's Special:MediaStatistics page
|
--bypass-cdn-image-compression | Bypass lossy CDN image compression used by some wikis (ex. Cloudflare Polish) |
--disable-image-verify, -V | Don't verify the image size and hash while downloading |
--retries | How many times to retry each request before the job fails |
--hard-retries | How many times to retry hard failures on requests (for example, interrupted connections) |
--curonly, -n | Only download the latest revision of each page. Not compatible with --xmlrevisions
|
Usage Tips
KeyError: 'allrevisions'
means that the wiki is too old to support--xmlrevisions
, try--xmlapiexport
or just--xml
on its own instead.- If you get an error like
ERROR: Unsupported wiki. Wiki engines supported are: MediaWiki
but you're sure you tried to dump a MediaWiki:- Try to find the
index.php
andapi.php
paths of the wiki and pass them directly to the bot. Clicking the login link will usually help find theindex.php
page.index.php
andapi.php
are almost always interchangeable - Open inspect element and look for a comment at the top about
HTTrack
. If this appears, the wiki is a static conversion, and ArchiveBot will be needed instead
- Try to find the
- A
ChunkedEncodingError
usually just means the job needs to be resumed. You may also want to consider a higher--hard-retries
if the wiki seems to consistently have this problem (Fandom is an example that often has these issues) - If the wiki is in another language, replacing the page name in the URL with
Special:Version
or the English name of another special page will usually redirect you to the right place. - You can use the wiki's
Special:Version
page to check if it is new enough to use `--xmlrevisions`. - The
--delay
of a job can be changed while it is running by someone with access to the server wikibot runs on, so ask in the channel if you need to adjust the speed of a job.
DokuWiki
Uses DokuWiki Dumper to archive the contents of a DokuWiki wiki.
!dw <options>
will dump a DokuWiki wiki.
Parameter | Description |
---|---|
--auto | Dump: content+media+html, threads=5, ignore-action-disable-edit. (threads is overridable) |
--ignore-disposition-header-missing | Do not check Disposition header, useful for outdated (<2014) DokuWiki versions [default: False] |
--threads | Number of sub threads to use [default: 1], not recommended to set > 5 |
--ignore-action-disabled-edit | Some sites disable edit action for anonymous users and some core pages. This option will ignore this error, and the textarea not found error. You may only get a partial dump. (only works with --content )
|
--current-only | Download only the latest revision of each page |
--retry | How many times to retry each request before the job fails |
--hard-retry | How many times to retry hard failures on requests (for example, interrupted connections) |
--content | Dump content |
--media | Dump media |
--html | Dump HTML |
Dump PDFs on wikis with the PDF export plugin |
PukiWiki
Uses PukiWiki Dumper to archive the contents of a PukiWiki wiki.
!pw <options>
will dump a PukiWiki wiki.
Parameter | Description |
---|---|
--auto | Dump: content+media, threads=2, current-only. (threads is overridable) |
--threads | Number of sub threads to use [default: 1], not recommended to set > 5 |
--ignore-action-disabled-edit | Some sites disable edit action for anonymous users and some core pages. This option will ignore this error, and the textarea not found error. You may only get a partial dump. (only works with --content )
|
--trim-php-warnings | Trim PHP warnings from responses |
--verbose | verbose output |
--current-only | Download only the latest revision of each page |
--retry | How many times to retry each request before the job fails |
--hard-retry | How many times to retry hard failures on requests (for example, interrupted connections) |
--content | Dump content |
--media | Dump media |