GeoCities Project

From Archiveteam
Revision as of 10:44, 23 October 2009 by Jscott (talk | contribs)
Jump to navigation Jump to search

The Geocities Project

Upon the news of the closing of Geocities by Yahoo, Archive Team initiated the Geocities Project, a coordinated effort to rescue as much of Geocities' data off of the to-be-decomissioned Geocities servers. This project was begun in April of 2009, and continued throughout the summer of 2009 up to the closing date of October 26, 2009 by Yahoo. A list of Frequently Asked Questions about this project was generated and is available Here.

Parallel to our efforts (and in conjunction with them) archive.org began a major "deep crawl" of Geocities to add to their wayback machine. The page for their project is here.

Geocities Neighborhoods

Before the acquisition by Yahoo, Geocities used an unusual organization method for its userbase: Neighborhoods. Separating the subject matter of the pages by taste, neighborhoods with names like Area51 (Science Fiction and Fantasy), Nashville (Country Music), Augusta (Golf) and others allowed for an easier time of finding subject matter the browser was searching for. It helps to give context that search engines as the modern world knows them did not exist in such force.

A neighborhood would have up to 9,999 accounts underneath them, with the numbers representing the user's "block". Over time, Geocities added "Suburbs", which allowed an expansion past 9,999 users; these would have names like "Vault" and "Cavern" under the "Area51" neighborhood. A URL would then be available in the form of www.geocities.com/NEIGHBORHOOD/SUBURB/XXXX.

Geocities Homestead Neighborhoods and Suburbs, although having not been updated since 2007, gives an excellent overview of the Geocities history of Neighborhood organization.

The Size of Geocities Accounts

We're tracking how big a given account can store. So far, we know this (some news is contradictory, we're looking for press releases):

  • 1997: 2mb Limit for Geocities. [1]
  • 1998: 15mb limit for small business service [2]
  • 1999: Geocities has 12 terabytes of storage. [3]
  • 2001: 15mb for Geocities, 25mb for $8.95 a month [4]
  • 2002: 15mb Limit for Geocities.
  • 2002: 25mb for the newly introduced "Geocities Plus"
  • 2003: 25mb for Geocities Plus (As of June)
  • 2005: 75mb for Geocities Plus (As of January)
  • 2005: 25mb for Geocities Plus (As of April)

Yahoo's Site Explorer shows 23M html pages in Yahoo's index as of April 29th, 2009.

Tips n' Tricks

  • Although simple directory listings aren't accessible user's accounts, you might be able to obtain Apache-style directory listing for their subdirectories. For example, by stripping off the page filename for http://www.geocities.com/nenehs_world1/discography/homebrew.html, we can obtain an index for the subdirectory http://www.geocities.com/nenehs_world1/discography/; the benefit of this is that there may exist files which are not linked internally or externally, so crawlers are not made aware of them. Unfortunately, it seems many users do not organize their content into subdirectories, instead preferring to dump all files directly into the user directory. Also, they may have been good webmasters and provided a directory index which overrides directory listings.

Lists

Users involved

  • User:Jscott, Joey paulprote and many others are downloading the main www.geocities.com stuff.
  • User:Soult downloaded parts of de.geocities.com, which is available as tar archive here (download takes 1-2 minutes to start before the first packets arrive, be patient)
  • User:Bbot is mirroring downloaded content.
  • User:Scumola is crawling geocities using the archive.org crawler but on hold in June due to Comcast's 250GB bandwidth limit. Will resume in July.
  • Asheesh Laroia (User:Paulproteus) helped test User-Agent tricks to download from Geocities, and purchased geociti.es.
Uf009617.gif