GeoCities Japan
GeoCities Japan | |
URL | http://www.geocities.jp/[IA•Wcite•.today•MemWeb] http://www.geocities.co.jp/[IA•Wcite•.today•MemWeb] |
Status | Offline |
Archiving status | Partially saved |
Archiving type | Unknown |
IRC channel | #archiveteam-bs (on hackint) (formerly #notagain (on EFnet)) |
Project lead | User:Hiroi, User:DoomTay |
GeoCities Japan was the Japanese version of GeoCities. It survived the 2009 shutdown of the global platform and shut down end of March 2019.
Shutdown
On 2018-10-01, Yahoo! Japan announced that they would be closing GeoCities at the end of March 2019. (New accounts could still be created until 2019-01-10.) It shut down on 2019-04-01 shortly after midnight JST.
Crawl Summaries
(Please add your crawls here)
- Nov 9 2018: crawl done using seeds compiled from IA’s existing CDX data (see below). Available on IA (currently being uploaded).
- Total size: 3.7TB (uncompressed: 3.9TB)
- Total URLs crawled: 96M
- Crawl report[IA•Wcite•.today•MemWeb], Hostname list[IA•Wcite•.today•MemWeb], MIME type report[IA•Wcite•.today•MemWeb]
Deduplication
We'll follow roughly the deduplication schema outlined here, but with a shared MySQL-complaint database. (The database will be online soon; in the meantime, you can begin to prepare the metadata following the description below.)
The deduplication workflow goes as follows:
- During / after individual crawls, each person generates the metadata (using warcsum or other tools) corresponding to their crawled WARC files, following the schema below.
- Metadata is then inserted into the database. It is crucial that this table does not get screwed up, so please contact me (hiroi on IRC channel) for access if you'd want to add your data.
- If time/resource permits, the uploader may fill in deduplication info at the time of insertion, but this is not required.
- That's because (provided that all warc files are available for download) the metadata in the db is enough for standalone duplication.
- A specific worker machine will be running through this table continuously and filling in deduplication info (ref_id, ref_uri, ref_date).
- As of now such script hasn't actually been written yet. If you're willing to write it up, please let User:Hiroi know via IRC.
- At the time of release, we'll use this database to deduplicate all WARC archives at once (by replacing duplicated entries with revisit records) and combine all together for release.
The database schema is given by the following. For details on warc_offset and warc_len, please see source code of warcsum and other tools.
Table warc_records +---------------+--------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +---------------+--------------+------+-----+---------+----------------+ | id | int(11) | NO | PRI | NULL | auto_increment | | name | varchar(1024)| NO | | NULL | | (WARC file name) | size | bigint(20) | NO | | NULL | | (size of the file) | location | varchar(2083)| YES | | NULL | | (current available location, i.e. download link) | digest | varchar(1024)| YES | | NULL | | (hash of the entire file) +---------------+--------------+------+-----+---------+----------------+ Table uri_records +---------------+--------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +---------------+--------------+------+-----+---------+----------------+ | id | int(11) | NO | PRI | NULL | auto_increment | | warc_id | int(11) | NO | | NULL | | (warc_records.id) | warc_offset | bigint(20) | NO | | NULL | | (the offset of individual record in WARC file) | warc_len | bigint(20) | NO | | NULL | | (length of the (compressed) individual record) | uri | varchar(2083)| NO | | NULL | | (uri of the record) | datetime | varchar(256) | NO | | NULL | | (access time, taken from WARC file directly) | digest | varchar(1024)| NO | | NULL | | (default value is "sha1:xxxxxx") | ref_id | int(11) | YES | | NULL | | (original copy's id, if the record is a duplicate) | ref_uri | varchar(2083)| YES | | NULL | | (original copy's uri, can be filled in to reduce queries) | ref_date | varchar(256) | YES | | NULL | | (original copy's date) +---------------+--------------+------+-----+---------+----------------+
Discovery Info
- DNS CNAMEs for geocities (JSON format):
[1](dead link), [2] - Records compiled from IA’s CDX data, available here (alternative link: [3])
- geocities_jp_first.txt: First level subdirectory list under geocities.jp, compiled from IA CDX data. 566,690 records in total.
- geocities_co_jp_first.txt: Same as above, for geocities.co.jp. 12,470 records in total.
- NOTE: The majority of sites under geocities.co.jp are not first-level sites, but "neighborhood" sites which are second-level (there could be, in theory, 1.79M of them; how many actually exist unknown), see explanation below.
- blogs_yahoo_co_jp_first.txt: Same as above, for blogs.yahoo.co.jp. 646,901 records in total.
- geocities_co_jp_fields.txt: List of neighborhood names under geocities.co.jp.
- Individual websites are listed in the following format:
http://www.geocities.co.jp/[NeighborhoodName]/[AAAA]
whereAAAA
ranges from 1000 to 9999.
- Individual websites are listed in the following format:
- include-surts.txt: List of subdomains that should be allowed by your crawler.
- geocities.jp grab from E-Shuushuu Wiki, crawled as job:cu6azkjwy45qmo1wwdxsdfusj: Pastebin[IA•Wcite•.today•MemWeb]
- geocities.jp grab from Danbooru, crawled as job:5x0pf7wloqgeqc2r9rddino2l: Gist[IA•Wcite•.today•MemWeb]
- geocities.co.jp and missed geocities.jp URLs grabbed from the above targets, crawled as job:31ges4c4c96k140sp6zah5vcc:
[4](dead link), [5] - geocities.co.jp and geocities.jp crawl from Miss Surfersparadise, crawled as job:e8ynrp5a7p4vwjkyxw9eph9p0:
[6](dead link), [7] - Crawls from links within links from this Business Insider article job:ayildv5yxmeo6s7egxni9dlnd https://transfer.sh/uPLU4/biscrapes.txt[IA•Wcite•.today•MemWeb]
- Sites collated by User:Sanqui job:cp5r3a9fifipnbxo8hsy4tmhx https://etc.sanqui.net/archiveteam/geocities.jp_various.txt[IA•Wcite•.today•MemWeb]
- Scrapes from Ragsearch job:adh7m0i9ka25buvdlabm0p9ii [8] job:dmde087vgmmjluo9qjodob1ai [9] job:54l4xfl49rqpfttrkbzv968zm [10]
- Scrapes from PuniTo job:2752dep7k79puge1a9mdo93x1 [11]
- job:eoy17cb66jg4f9vmgi0v9fexo [12]
- Scrapes from Amaterasu (NSFW) job:2vbwnt5l8nipjddqo17ex2r3j [13]
- Scrapes from Surfers Paradise job:chr2z6wrw4srlmxo489wksqef [14]
- Scrapes from Meguri-net and Oisearch job:5m5qct4quwkn3blzgitqtd3uq https://transfer.sh/2qlfJ/meguri+oisan.txt[IA•Wcite•.today•MemWeb] [15]
- Scrapes from Game-Michi job:5p4pvzxl74gxrj8dtky87kpfo [16]
- Scrapes from Bishoujo NAVI job:1923nftkucm16x888vyvcvuvb [17]
- Scrapes from Love Hina Search job:bcxxlfuso9uveek93abd6ua2y [18]
- Scrapes from MultiLink job:be6w30ni9v31t0rg5edq694k0 https://transfer.sh/z2lhW/multiez.txt[IA•Wcite•.today•MemWeb] [19]
- Scrapes from Gameha job:5ezyb53ch6ip4uklwgal4nsak [20]
- Scrapes from an earlier domain for Ragsearch job:cfv3zp5uj886dsp01gj4m1mt4 [21]
- job:aa63sfmum7cb3m58vvumtuosl filtered from https://geo.98nx.jp/list.txt[IA•Wcite•.today•MemWeb], from user nakomikan on IRC
- job:f4bz9nodrgq4m620auucpjpoe list of SDF doujinshi manga circles filtered from https://pastebin.com/6egiap0k[IA•Wcite•.today•MemWeb], from nakomikan on IRC
- job:5phcgljf5fxowviasvwpb0flh filtered from https://pastebin.com/f4y0Mrah[IA•Wcite•.today•MemWeb], from user nakomikan on IRC
Crawler Traps
- A common calendar CGI script, usually named “i-calendar.cgi”, seems to be able to trap Heritrix with timestamped infinite loops despite having TooManyHopsDecideRule on. (Example)
Issues
- Hidden-entry sites (Importance: Low): There are a few sites that do not use index.htm/index.html as their entry points; as a result, first level directory access will fail to reach them.
- However, as long as there are other geocities sites linked to them, they should be discoverable by the crawler.
- So the only problem are those pages whose inlinks are all dead. There should be very few of those. If we want to be absolutely sure, we can run a diff between IA's current CDX and that from the crawl.
- Notice that this is not a problem with the neighborhood sites as we can enumerate the URLs.
- Deduplication (Importance: Low): If we are going to release a torrent as we did with Geocities, they it may be worth to dedup. Most likely won't be a major difference.
- Final Snapshot (Importance: Moderate): The page contents may still change between now and March 31 2019, so we need to do another crawl when the time is near.
- Note that a lot of users will be setting up 301/302s before the server shuts down. According to Yahoo, we'll have until Sep 30 2019 to log down those 301/302s.