GeoCities Japan

From Archiveteam
Jump to navigation Jump to search
GeoCities Japan
Geocities japan 2k.png
URL http://www.geocities.jp/[IAWcite.todayMemWeb]
http://www.geocities.co.jp/[IAWcite.todayMemWeb]
Status Offline
Archiving status Partially saved
Archiving type Unknown
IRC channel #archiveteam-bs (on hackint)
(formerly #notagain (on EFnet))
Project lead User:Hiroi, User:DoomTay

GeoCities Japan was the Japanese version of GeoCities. It survived the 2009 shutdown of the global platform and shut down end of March 2019.

Shutdown

On 2018-10-01, Yahoo! Japan announced that they would be closing GeoCities at the end of March 2019. (New accounts could still be created until 2019-01-10.) It shut down on 2019-04-01 shortly after midnight JST.

Crawl Summaries

(Please add your crawls here)

Deduplication

We'll follow roughly the deduplication schema outlined here, but with a shared MySQL-complaint database. (The database will be online soon; in the meantime, you can begin to prepare the metadata following the description below.)

The deduplication workflow goes as follows:

  1. During / after individual crawls, each person generates the metadata (using warcsum or other tools) corresponding to their crawled WARC files, following the schema below.
  2. Metadata is then inserted into the database. It is crucial that this table does not get screwed up, so please contact me (hiroi on IRC channel) for access if you'd want to add your data.
    • If time/resource permits, the uploader may fill in deduplication info at the time of insertion, but this is not required.
    • That's because (provided that all warc files are available for download) the metadata in the db is enough for standalone duplication.
  3. A specific worker machine will be running through this table continuously and filling in deduplication info (ref_id, ref_uri, ref_date).
    • As of now such script hasn't actually been written yet. If you're willing to write it up, please let User:Hiroi know via IRC.
  4. At the time of release, we'll use this database to deduplicate all WARC archives at once (by replacing duplicated entries with revisit records) and combine all together for release.

The database schema is given by the following. For details on warc_offset and warc_len, please see source code of warcsum and other tools.

Table warc_records
+---------------+--------------+------+-----+---------+----------------+
| Field         | Type         | Null | Key | Default | Extra          |
+---------------+--------------+------+-----+---------+----------------+
| id            | int(11)      | NO   | PRI | NULL    | auto_increment |
| name          | varchar(1024)| NO   |     | NULL    |                | (WARC file name)
| size          | bigint(20)   | NO   |     | NULL    |                | (size of the file)
| location      | varchar(2083)| YES  |     | NULL    |                | (current available location, i.e. download link)
| digest        | varchar(1024)| YES  |     | NULL    |                | (hash of the entire file)
+---------------+--------------+------+-----+---------+----------------+

Table uri_records
+---------------+--------------+------+-----+---------+----------------+
| Field         | Type         | Null | Key | Default | Extra          |
+---------------+--------------+------+-----+---------+----------------+
| id            | int(11)      | NO   | PRI | NULL    | auto_increment |
| warc_id       | int(11)      | NO   |     | NULL    |                | (warc_records.id)
| warc_offset   | bigint(20)   | NO   |     | NULL    |                | (the offset of individual record in WARC file)
| warc_len      | bigint(20)   | NO   |     | NULL    |                | (length of the (compressed) individual record)
| uri           | varchar(2083)| NO   |     | NULL    |                | (uri of the record)
| datetime      | varchar(256) | NO   |     | NULL    |                | (access time, taken from WARC file directly)
| digest        | varchar(1024)| NO   |     | NULL    |                | (default value is "sha1:xxxxxx")
| ref_id        | int(11)      | YES  |     | NULL    |                | (original copy's id, if the record is a duplicate)
| ref_uri       | varchar(2083)| YES  |     | NULL    |                | (original copy's uri, can be filled in to reduce queries)
| ref_date      | varchar(256) | YES  |     | NULL    |                | (original copy's date)
+---------------+--------------+------+-----+---------+----------------+

Discovery Info

Crawler Traps

  • A common calendar CGI script, usually named “i-calendar.cgi”, seems to be able to trap Heritrix with timestamped infinite loops despite having TooManyHopsDecideRule on. (Example)

Issues

  • Hidden-entry sites (Importance: Low): There are a few sites that do not use index.htm/index.html as their entry points; as a result, first level directory access will fail to reach them.
    • However, as long as there are other geocities sites linked to them, they should be discoverable by the crawler.
    • So the only problem are those pages whose inlinks are all dead. There should be very few of those. If we want to be absolutely sure, we can run a diff between IA's current CDX and that from the crawl.
    • Notice that this is not a problem with the neighborhood sites as we can enumerate the URLs.
  • Deduplication (Importance: Low): If we are going to release a torrent as we did with Geocities, they it may be worth to dedup. Most likely won't be a major difference.
  • Final Snapshot (Importance: Moderate): The page contents may still change between now and March 31 2019, so we need to do another crawl when the time is near.
    • Note that a lot of users will be setting up 301/302s before the server shuts down. According to Yahoo, we'll have until Sep 30 2019 to log down those 301/302s.