GeoCities Japan

From Archiveteam
Revision as of 10:15, 12 November 2018 by Hiroi (talk | contribs)
Jump to navigation Jump to search
GeoCities Japan
Geocities japan 2k.png
URL http://www.geocities.jp/, http://www.geocities.co.jp/
Status Closing
Archiving status In progress...
Archiving type Unknown
IRC channel #notagain (on hackint)

GeoCities Japan is the Japanese version of GeoCities. It survived the 2009 shutdown of the global platform.

Shutdown

On 2018-10-01, Yahoo! Japan announced that they would be closing GeoCities at the end of March 2019. (New accounts can still be created until 2019-01-10.)

Crawl Summaries

(Please add your crawls here)

  • Nov 9 2018: crawl done using seeds compiled from IA’s existing CDX data (see below). Available on IA (currently being uploaded).

Deduplication

We'll follow roughly the deduplication schema outlined here, but with a shared MySQL-complaint database. (The database will be online soon; in the meantime, you can begin to prepare the metadata following the description below.)

The deduplication workflow goes as follows:

  1. During / after individual crawls, each person generates the metadata (using warcsum or other tools) corresponding to their crawled WARC files, following the schema below.
  2. Metadata is then inserted into the database. It is crucial that this table does not get screwed up, so please contact me (hiroi on IRC channel) for access if you'd want to add your data.
    • If time/resource permits, the uploader may fill in deduplication info at the time of insertion, but this is not required.
    • That's because (provided that all warc files are available for download) the metadata in the db is enough for standalone duplication.
  3. A specific worker machine will be running through this table continuously and filling in deduplication info (ref_id, ref_uri, ref_date).
    • As of now such script hasn't actually been written yet. If you're willing to write it up, please let User:Hiroi know via IRC.
  4. At the time of release, we'll use this database to deduplicate all WARC archives at once (by replacing duplicated entries with revisit records) and combine all together for release.

The database schema is given by the following. For details on warc_offset and warc_len, please see source code of warcsum and other tools.

Table warc_records
+---------------+--------------+------+-----+---------+----------------+
| Field         | Type         | Null | Key | Default | Extra          |
+---------------+--------------+------+-----+---------+----------------+
| id            | int(11)      | NO   | PRI | NULL    | auto_increment |
| name          | varchar(1024)| NO   |     | NULL    |                | (WARC file name)
| size          | bigint(20)   | NO   |     | NULL    |                | (size of the file)
| location      | varchar(2083)| YES  |     | NULL    |                | (current available location, i.e. download link)
| digest        | varchar(1024)| YES  |     | NULL    |                | (hash of the entire file)
+---------------+--------------+------+-----+---------+----------------+

Table uri_records
+---------------+--------------+------+-----+---------+----------------+
| Field         | Type         | Null | Key | Default | Extra          |
+---------------+--------------+------+-----+---------+----------------+
| id            | int(11)      | NO   | PRI | NULL    | auto_increment |
| warc_id       | int(11)      | NO   |     | NULL    |                | (warc_records.id)
| warc_offset   | bigint(20)   | NO   |     | NULL    |                | (the offset of individual record in WARC file)
| warc_len      | bigint(20)   | NO   |     | NULL    |                | (length of the (compressed) individual record)
| uri           | varchar(2083)| NO   |     | NULL    |                | (uri of the record)
| datetime      | varchar(256) | NO   |     | NULL    |                | (access time, taken from WARC file directly)
| digest        | varchar(1024)| NO   |     | NULL    |                | (default value is "sha1:xxxxxx")
| ref_id        | int(11)      | YES  |     | NULL    |                | (original copy's id, if the record is a duplicate)
| ref_uri       | varchar(2083)| YES  |     | NULL    |                | (original copy's uri, can be filled in to reduce queries)
| ref_date      | varchar(256) | YES  |     | NULL    |                | (original copy's date)
+---------------+--------------+------+-----+---------+----------------+

Discovery Info

  • DNS CNAMEs for geocities (JSON format): [1] (dead link), [2]
  • Records compiled from IA’s CDX data, available here (alternative link: [3])
    • geocities_jp_first.txt: First level subdirectory list under geocities.jp, compiled from IA CDX data. 566,690 records in total.
    • geocities_co_jp_first.txt: Same as above, for geocities.co.jp. 12,470 records in total.
      • NOTE: The majority of sites under geocities.co.jp are not first-level sites, but "neighborhood" sites which are second-level (there could be, in theory, 1.79M of them; how many actually exist unknown), see explanation below.
    • blogs_yahoo_co_jp_first.txt: Same as above, for blogs.yahoo.co.jp. 646,901 records in total.
    • geocities_co_jp_fields.txt: List of neighborhood names under geocities.co.jp.
      • Individual websites are listed in the following format: http://www.geocities.co.jp/[NeighborhoodName]/[AAAA] where AAAA ranges from 1000 to 9999.
    • include-surts.txt: List of subdomains that should be allowed by your crawler.
  • geocities.jp grab from E-Shuushuu Wiki, crawled as job:cu6azkjwy45qmo1wwdxsdfusj: Pastebin[IAWcite.todayMemWeb]
  • geocities.jp grab from Danbooru, crawled as job:5x0pf7wloqgeqc2r9rddino2l: Gist[IAWcite.todayMemWeb]
  • geocities.co.jp and missed geocities.jp URLs grabbed from the above targets, crawled as job:31ges4c4c96k140sp6zah5vcc: [4] (dead link), [5]
  • geocities.co.jp and geocities.jp crawl from Miss Surfersparadise, crawled as job:e8ynrp5a7p4vwjkyxw9eph9p0: [6] (dead link), [7]
  • Crawls from links within links from this Business Insider article job:ayildv5yxmeo6s7egxni9dlnd [8]

Crawler Traps

  • A common calendar CGI script, usually named “i-calendar.cgi”, seems to be able to trap Heritrix with timestamped infinite loops despite having TooManyHopsDecideRule on. (Example)

Issues

  • Hidden-entry sites (Importance: Low): There are a few sites that do not use index.htm/index.html as their entry points; as a result, first level directory access will fail to reach them.
    • However, as long as there are other geocities sites linked to them, they should be discoverable by the crawler.
    • So the only problem are those pages whose inlinks are all dead. There should be very few of those. If we want to be absolutely sure, we can run a diff between IA's current CDX and that from the crawl.
    • Notice that this is not a problem with the neighborhood sites as we can enumerate the URLs.
  • Deduplication (Importance: Low): If we are going to release a torrent as we did with Geocities, they it may be worth to dedup. Most likely won't be a major difference.
  • Final Snapshot (Importance: Moderate): The page contents may still change between now and March 31 2019, so we need to do another crawl when the time is near.
    • Note that a lot of users will be setting up 301/302s before the server shuts down. According to Yahoo, we'll have until Sep 30 2019 to log down those 301/302s.