ArchiveBot/Bot documentation

From Archiveteam
Jump to navigation Jump to search

This page documents how the User:HadeanEon ArchiveBot wiki bot works.

Introduction

The bot takes a list of URLs and generates a table out of it, listing for each URL the relevant ArchiveBot jobs. This allows collecting resources related to a particular topic and keeping an overview of which have been archived.

Basics

The bot needs two pages. One is the page where the table is shown ("tracking page"), and the other contains the URL list ("list page"). The name of the list page is equal to the one of the tracking page with /list appended.

Create the page "ArchiveBot/Example" with this text:

Optional: an introduction what this page is about if it's not obvious from the title.

<!-- bot --><!-- /bot -->

{{archivebot}}

And the page "ArchiveBot/Example/list" with a plain list of URLs:

https://an.example.org/
https://another.example.net/

Once the bot processes the page, "ArchiveBot/Example" will look something like this:

  • Statistics: Saved! (0) · Not saved yet (2) · Total size (0 KiB)

Do not edit this table, it is automatically updated by bot. There is a raw list of URLs that you can edit.

Details

The bot replaces the contents between <!-- bot --> and <!-- /bot --> with its output. Editing any of it is useless since it will be overwritten the next time the bot runs, but you can do anything before or after those bot marks.

The bot also sorts the list page before generating the table. It ignores the protocol as well as www. in this sorting. There is special treatment for some file hosters (namely: transfer.notkiska.pw, ix.io (when tricked into using a filename in the URL), and transfer.sh) in that the file ID on those is removed from the URL before sorting such that the URLs are sorted by filename instead. If two entries have the same such post-processed URL, they are next sorted by the label, then by the full URL, then by the note, and finally by the full line as entered on the list page. Duplicate lines are removed.

Labels

You can add a label to a URL, which will be displayed instead of the URL in the table. Note that the entries will still be sorted according to the URL.

Usage (on the list page):

https://example.org/some/really/long/url/that/should/not/appear/in/full/in/the/table.php.html.aspx | label = Example page

would cause the link to be rendered as:

Example page

Sections

If you want to further divide the tracking page to avoid huge, unmanageable tables and lists, you can use sections. Simply use sections on the list page – the section level is ignored entirely by the bot –, then refer to them using <!-- bot:Section name --> on the tracking page. The closing tag stays the same, <!-- /bot -->. The <!-- bot --> tag can be used to refer to anything that appears before the first section on the list page.

Note that the section names refer to the list page's sections; the sections on the tracking page can be titled, ordered, and nested differently and are irrelevant for the bot.

For example, on "ArchiveBot/Section example":

<!-- bot --><!-- /bot -->

== A section ==
<!-- bot:Part two --><!-- /bot -->

=== A subsection ===
<!-- bot:Part one --><!-- /bot -->

{{archivebot}}

And on "ArchiveBot/Section example/list":

http://example.org/

== Part one ==
https://something.example.net/

== Part two ==
https://foo.bar.example.com/

Notes

The bot also supports notes. Using a note on any entry within a list page section causes the bot to add an extra column to the table that contains the note. This is useful for example when listing social media profiles: the main URL would be the transfer.notkiska.pw URL (so that the bot can detect it was saved, since this is what's fed into ArchiveBot), and the note field can contain the direct link to the relevant profile.

Usage (on the list page):

https://example.org/ | note = Something
https://transfer.notkiska.pw/fileid/twitter-@textfiles | note = https://twitter.com/textfiles

Caveats

  • Some URLs are never detected as saved by the bot even though they were saved. This is mostly due to bugs or missing features in the ArchiveBot viewer.
  • It can take a while until pages are (re-)processed by the bot. Usually, it should happen once per day.

Code

The bot's code is on GitHub.