Difference between revisions of "Angelfire"

From Archiveteam
Jump to navigation Jump to search
(expand)
(Update status, link to collection)
(13 intermediate revisions by 8 users not shown)
Line 5: Line 5:
| URL = http://www.angelfire.lycos.com/
| URL = http://www.angelfire.lycos.com/
| project_status = {{online}}
| project_status = {{online}}
| archiving_status = {{nosavedyet}}
| archiving_status = {{inprogress}} (on hiatus since 2019)
| archiving_type = DPoS, ArchiveBot
| source = [https://github.com/ArchiveTeam/angelfire-grab angelfire-grab]
| tracker = [http://tracker.archiveteam.org/angelfire/ angelfire]
| irc = angelonfire
| data = {{IA collection|archiveteam_angelfire}}<br />{{Job|9yhap}}
}}
}}


Line 14: Line 19:
Angelfire underwent some changes in 2010, apparently not disruptive but requiring users to pay for some options like the old Web Shell tool; we do not know whether this caused some older websites to become unaccessible for their owners and whether that could cause inactivity and hence deletion. The Alexa rank of the property seems in constant fall, from better than 2000th position in early 2012 to worse than 3400th in early 2014.
Angelfire underwent some changes in 2010, apparently not disruptive but requiring users to pay for some options like the old Web Shell tool; we do not know whether this caused some older websites to become unaccessible for their owners and whether that could cause inactivity and hence deletion. The Alexa rank of the property seems in constant fall, from better than 2000th position in early 2012 to worse than 3400th in early 2014.


It's not clear how bad [[Lycos]] is. A quick seach for Lycos shutdowns only points to their (independently operated) [[wikipedia:Lycos Europe|Lycos Europe]] liquidation, which gave less than a month for the users to save their emails before deletion. Lycos [[Tripod]] on the other hand, which was in 2003 [http://googlepress.blogspot.it/2003/06/google-and-lycos-europe-announce.html Europe's largest homepage building community (with special Google alliance)], [https://gigaom.com/2009/02/12/419-lycos-europe-finds-a-tripod-buyer-three-days-before-shutdown/ found a last minute buyer for its European wing] but then [https://web.archive.org/web/20130728005304/http://www.multimania.co.uk/ suddenly  went down in July 2013] (it was around 60,000th Alexa position in 2012 and fell well below 100,000 in early 2013).
It's not clear how bad [[Lycos]] is. A quick search for Lycos shutdowns only points to their (independently operated) [[wikipedia:Lycos Europe|Lycos Europe]] liquidation, which gave less than a month for the users to save their emails before deletion. Lycos [[Tripod]] on the other hand, which was in 2003 [http://googlepress.blogspot.it/2003/06/google-and-lycos-europe-announce.html Europe's largest homepage building community (with special Google alliance)], [https://gigaom.com/2009/02/12/419-lycos-europe-finds-a-tripod-buyer-three-days-before-shutdown/ found a last minute buyer for its European wing] but then [https://web.archive.org/web/20130728005304/http://www.multimania.co.uk/ suddenly  went down in July 2013] (it was around 60,000th Alexa position in 2012 and fell well below 100,000 in early 2013).
 
== Status ==
Warrior project is coming soon -- scripts are just about done so stay tuned!
 
All usernames/user info for scraping individual user's sitemaps can be found here: https://archive.org/details/angelfire-users-all_201808
 
Archivebot gave it a try, http://archive.fart.website/archivebot/viewer/job/9yhap
 
Schbirid has some ugly Bash scripts: https://github.com/SpiritQuaddicted/angelfire (ask before you use, they are probably out of date)
== Discovery & Downloading ==
First grab all the sitemap indexes:
 
curl http://www.angelfire.com/robots.txt | grep -Eo 'http.*gz' > sitemap-index-urls
<pre>
http://www.angelfire.com/sitemap-index-00.xml.gz
http://www.angelfire.com/sitemap-index-01.xml.gz
http://www.angelfire.com/sitemap-index-02.xml.gz
...
http://www.angelfire.com/sitemap-index-ff.xml.gz
</pre>
 
 
Use that to grab all the sitemaps:
 
wget -i sitemap-index-urls
 
Inside you will see the users' sitemaps URLs
<pre>
<sitemap><loc>http://www.angelfire.com/punk4/jori_loves_jackass/sitemap.xml</loc><lastmod>2012-04-10</lastmod></sitemap>
<sitemap><loc>http://www.angelfire.com/vevayaqo/sitemap.xml</loc><lastmod>2012-04-10</lastmod></sitemap>
<sitemap><loc>http://www.angelfire.com/planet/dumbass123/sitemap.xml</loc><lastmod>2012-04-10</lastmod></sitemap>
...
</pre>
 
 
Extract the user sitemap URLs:
 
zgrep -hEo 'http:.*xml' sitemap-index-*.xml.gz > sitemap-urls
<pre>
http://www.angelfire.com/punk4/jori_loves_jackass/sitemap.xml
http://www.angelfire.com/vevayaqo/sitemap.xml
http://www.angelfire.com/planet/dumbass123/sitemap.xml
...
</pre>
 
Extract the webpage URLs:
 
grep -Eo '<loc>.*</loc>' www.angelfire.com/"${user}"/sitemap.xml | sed 's#<loc>##' | sed 's#</loc>##' > "${user}.urls"
<pre>
http://www.angelfire.com/ab7/pledgecry/band.html
http://www.angelfire.com/ab7/pledgecry/biography.html
http://www.angelfire.com/ab7/pledgecry/ernst.html
http://www.angelfire.com/ab7/pledgecry/header.html
...
</pre>
 
Grab them with options like: -m --no-parent --no-cookies -e robots=off --page-requisites --domains=angelfire.com,lycos.com
 
 
As of 2015-05-08 there are 3895290 users
 
You will want --no-cookies because angelfire wants to set them everywhere.
 
Reject http://www.angelfire.lycos.com/doc/images/track/ot_noscript.gif.* and reject http://www.angelfire.com/adm/ad/ (ads) --> --reject-regex='(www.angelfire.com\/adm\/ad\/|www.angelfire.com\/doc\/images\/track\/ot_noscript\.gif)'
 
Some images are hosted on http://www.angelfire.lycos.com --> --domains=angelfire.com,lycos.com
 
 
Guestbooks have been killed in 2012, e.g. http://htmlgear.lycos.com/guest/control.guest?u=gosanson&i=2&a=view
 
Some users have blogs with infinite calendars, like this in the sitemap: http://filesha.angelfire.com/blog/index.blog . Wget will run infinitely on those, better skip them for now.
 
Many users have no URLs in their sitemaps. Not sure what to do with those.


== External links ==
== External links ==
Line 21: Line 99:


{{Navigation box}}
{{Navigation box}}
[[Category:Web hosting]]

Revision as of 04:52, 28 November 2021

Angelfire
Angelfire- Welcome to Angelfire 1303510943179.png
URL http://www.angelfire.lycos.com/
Status Online!
Archiving status In progress... (on hiatus since 2019)
Archiving type DPoS, ArchiveBot
Project source angelfire-grab
Project tracker angelfire
IRC channel #angelonfire (on hackint)
Data[how to use] archiveteam_angelfire
job:9yhap

Angelfire is a web hosting service since 1996, containing big chunks of early WWW history (which people love to mock at).

It is not expected that the Angelfire archive can ever be truly complete, as Angelfire, like other free hosts such as Homestead, has or had a policy of deleting "inactive" accounts. As there is no known mirror of many of these former accounts and associated web pages, there may be no way to recover such deleted websites.

Angelfire underwent some changes in 2010, apparently not disruptive but requiring users to pay for some options like the old Web Shell tool; we do not know whether this caused some older websites to become unaccessible for their owners and whether that could cause inactivity and hence deletion. The Alexa rank of the property seems in constant fall, from better than 2000th position in early 2012 to worse than 3400th in early 2014.

It's not clear how bad Lycos is. A quick search for Lycos shutdowns only points to their (independently operated) Lycos Europe liquidation, which gave less than a month for the users to save their emails before deletion. Lycos Tripod on the other hand, which was in 2003 Europe's largest homepage building community (with special Google alliance), found a last minute buyer for its European wing but then suddenly went down in July 2013 (it was around 60,000th Alexa position in 2012 and fell well below 100,000 in early 2013).

Status

Warrior project is coming soon -- scripts are just about done so stay tuned!

All usernames/user info for scraping individual user's sitemaps can be found here: https://archive.org/details/angelfire-users-all_201808

Archivebot gave it a try, http://archive.fart.website/archivebot/viewer/job/9yhap

Schbirid has some ugly Bash scripts: https://github.com/SpiritQuaddicted/angelfire (ask before you use, they are probably out of date)

Discovery & Downloading

First grab all the sitemap indexes:

curl http://www.angelfire.com/robots.txt | grep -Eo 'http.*gz' > sitemap-index-urls

http://www.angelfire.com/sitemap-index-00.xml.gz
http://www.angelfire.com/sitemap-index-01.xml.gz
http://www.angelfire.com/sitemap-index-02.xml.gz
...
http://www.angelfire.com/sitemap-index-ff.xml.gz


Use that to grab all the sitemaps:

wget -i sitemap-index-urls

Inside you will see the users' sitemaps URLs

<sitemap><loc>http://www.angelfire.com/punk4/jori_loves_jackass/sitemap.xml</loc><lastmod>2012-04-10</lastmod></sitemap>
<sitemap><loc>http://www.angelfire.com/vevayaqo/sitemap.xml</loc><lastmod>2012-04-10</lastmod></sitemap>
<sitemap><loc>http://www.angelfire.com/planet/dumbass123/sitemap.xml</loc><lastmod>2012-04-10</lastmod></sitemap>
...


Extract the user sitemap URLs:

zgrep -hEo 'http:.*xml' sitemap-index-*.xml.gz > sitemap-urls

http://www.angelfire.com/punk4/jori_loves_jackass/sitemap.xml
http://www.angelfire.com/vevayaqo/sitemap.xml
http://www.angelfire.com/planet/dumbass123/sitemap.xml
...

Extract the webpage URLs:

grep -Eo '<loc>.*</loc>' www.angelfire.com/"${user}"/sitemap.xml | sed 's#<loc>##' | sed 's#</loc>##' > "${user}.urls"

http://www.angelfire.com/ab7/pledgecry/band.html
http://www.angelfire.com/ab7/pledgecry/biography.html
http://www.angelfire.com/ab7/pledgecry/ernst.html
http://www.angelfire.com/ab7/pledgecry/header.html
...

Grab them with options like: -m --no-parent --no-cookies -e robots=off --page-requisites --domains=angelfire.com,lycos.com


As of 2015-05-08 there are 3895290 users

You will want --no-cookies because angelfire wants to set them everywhere.

Reject http://www.angelfire.lycos.com/doc/images/track/ot_noscript.gif.* and reject http://www.angelfire.com/adm/ad/ (ads) --> --reject-regex='(www.angelfire.com\/adm\/ad\/|www.angelfire.com\/doc\/images\/track\/ot_noscript\.gif)'

Some images are hosted on http://www.angelfire.lycos.com --> --domains=angelfire.com,lycos.com


Guestbooks have been killed in 2012, e.g. http://htmlgear.lycos.com/guest/control.guest?u=gosanson&i=2&a=view

Some users have blogs with infinite calendars, like this in the sitemap: http://filesha.angelfire.com/blog/index.blog . Wget will run infinitely on those, better skip them for now.

Many users have no URLs in their sitemaps. Not sure what to do with those.

External links