From Archiveteam
Jump to navigation Jump to search

Mirroring a phpBB forum with wget can be a pain in the ass. Generally mirroring any forum is usually that.

Recommended Wget switches

The following assumes that you are using an up-to-date Wget, 1.14 or later, for WARC support and regex rejects.

Assuming that you want to crawl the forum as a random visitor, not logged in as a user.

To avoid getting a URL with a session ID in it, first grab any page and save the cookie you get:

wget --spider --keep-session-cookies --save-cookies=COOKIEFILE

You now have a cookie, no more session ID URLs, yay! You are ready to mirror now:

wget -m -np -w 1 -a www.example.com_phpBB3_$(date +%Y%m%d).log -e robots=off -nv --adjust-extension --convert-links --page-requisites  --reject-regex='(\?p=|&p=|mode=reply|view=|search.php)' --warc-file=www.example.com_phpBB3_$(date +%Y%m%d) --warc-cdx --keep-session-cookies --load-cookies=COOKIEFILE

The reject-regex rules explained:

?p= and &p= only occur when linking to single posts. You are interested in the full threads so these would be redundant.

mode=reply would give you a page where you could enter a reply, you are not logged in so changes are it would just deny that. But even if, there probably would be no meaningful content to save.

view= prevents Wget from downloading view=previous and view=next which are redundant (each thread has a link to the next and previous thread, the resulting URLs have the thread ID of the source thread (wtf...) and the view= parameter). It also prevents downloading view=print which would be redundant again. If you want to extract data later, you might want to include the print view, your choice.

search.php is obviously not wanted.

If the forum blocks random visitors (like you with Wget) from accessing user profile pages, then add mode=viewprofile to the reject-regex!


Might want to reject these too:

  • mode=post
  • mode=email
  • mode=quote
  • mode=newtopic
  • login.php

Do not reject:

keep mode=joined (memberlist!)