Talk:Angelfire
Some brainstorming from procrastination:
First grab all the sitemap indexes: curl http://www.angelfire.com/robots.txt | grep -Eo 'http.*gz' > sitemap-index-urls
http://www.angelfire.com/sitemap-index-00.xml.gz http://www.angelfire.com/sitemap-index-01.xml.gz http://www.angelfire.com/sitemap-index-02.xml.gz ...
Use that to grab all the sitemaps:
wget -i sitemap-index-urls
<sitemap><loc>http://www.angelfire.com/punk4/jori_loves_jackass/sitemap.xml</loc><lastmod>2012-04-10</lastmod></sitemap> <sitemap><loc>http://www.angelfire.com/vevayaqo/sitemap.xml</loc><lastmod>2012-04-10</lastmod></sitemap> <sitemap><loc>http://www.angelfire.com/planet/dumbass123/sitemap.xml</loc><lastmod>2012-04-10</lastmod></sitemap> ...
Extract the urls:
zgrep -hEo 'http:.*xml' sitemap-index-*.xml.gz > sitemap-urls
http://www.angelfire.com/punk4/jori_loves_jackass/sitemap.xml http://www.angelfire.com/vevayaqo/sitemap.xml http://www.angelfire.com/planet/dumbass123/sitemap.xml ...
And grab them all:
wget --force-directories -i sitemap-urls
TODO: Find a smart way to grab everything from that.
You will want --no-cookies and reject http://www.angelfire.lycos.com/doc/images/track/ot_noscript.gif.* Some images are hosted on http://www.angelfire.lycos.com and will require some smart hackery.
You can also extract the "realms" and username combinations from the sitemap-indexes:
zgrep -hEo 'http:.*xml' ori/sitemap-index-*.xml.gz | sed 's#http://www.angelfire.com/##' | sed 's#/sitemap.xml##' | sed 's#/#\t#'
Warning: There are usernames without a "realm" prefix! Like the random jeshare, seacrozzer or hjones669.
Guestbooks have been killed in 2012, eg http://htmlgear.lycos.com/guest/control.guest?u=gosanson&i=2&a=view