Difference between revisions of "VBulletin"
(cat) |
|||
Line 37: | Line 37: | ||
etc. | etc. | ||
</pre> | </pre> | ||
[[Category:Web Applications]] |
Revision as of 22:46, 6 September 2012
Archiving vBulletin (tested only with http://boards.cityofheroes.com/, you may have to change some things):
1. Get a recent Wget+Lua version (it should include WARC support).
2. Get the vbulletin.lua script: https://raw.github.com/ArchiveTeam/cityofheroes-grab/master/vbulletin.lua
3. Collect the forum IDs (the f=
parameter in the urls) of forums and subforums. Some pages have a "Forum Jump" dropdown list that gives you the numbers.
Run Wget with the Lua script and seed it with the forum URLs. Start with the URL to /external.php?type=RSS2
to get a session cookie (having a session cookie is necessary to remove the session ID from the URLs).
The Lua script will navigate the forum pages: it will follow pagination links, go from forumdisplay to threads, from threads to posts and members. Use --page-requisites and --span-hosts to get the images. When preparing the seed URLs, be aware that the Lua script only crawls from forum to thread to post/member. It does not, for example, jump from one forum to the other or from a thread back to the forum.
For example, this works for the City of Heroes forums:
./wget-lua \ -U "$USER_AGENT" \ -nv \ -o wget.log \ --directory-prefix files/ \ --keep-session-cookies \ --save-cookies cookies.txt \ --force-directories \ --adjust-extension \ -e "robots=off" \ --page-requisites --span-hosts \ --lua-script vbulletin.lua \ --timeout 10 \ --tries 3 \ --waitretry 5 \ --warc-file forum \ --warc-header "operator: Archive Team" \ "http://boards.cityofheroes.com/external.php?type=RSS2" \ "http://boards.cityofheroes.com/forumdisplay.php?f=547" \ "http://boards.cityofheroes.com/forumdisplay.php?f=569" \ "http://boards.cityofheroes.com/forumdisplay.php?f=660" \ etc.