Nifty

Nifty
Japanese ISP with web hosting
URL	homepage.nifty.com
Status	Offline
Archiving status	Saved!
Archiving type	Unknown
Project source	https://github.com/ArchiveTeam/nifty-discovery
IRC channel	#archiveteam-bs (on hackint) (formerly #niftyjanai (on EFnet))
Project lead	User:Sanqui, User:DoomTay

Japanese ISP providing web hosting. Will be closing about 140,000 unclaimed homepages by 2016-11-10 15:00. Termination notice^{[IA•Wcite•.today•MemWeb]} (Japanese)

http://homepage1.nifty.com/USERNAME/
http://homepage2.nifty.com/USERNAME/
http://homepage3.nifty.com/USERNAME/

URL harvesting

Let's follow Site exploration.

<polm> One thing I would recommend is searching Hatena Bookmarks, which is like a Japanese free Pinboard
<polm> Like so: http://b.hatena.ne.jp/entrylist?url=homepage2.nifty.com
<polm> the "of" query parameter paginates like so: http://b.hatena.ne.jp/entrylist?url=homepage2.nifty.com&of=20
<zout> there's some here. https://archive.is/homepage2.nifty.com

Progress

On 2016-09-12, User:Sanqui harvested 8884 *.nifty.com URLs from Wikimedia sites using mwlinkscrape
On 2016-09-13, root homepages were added to this list, making it 11423 URLs: https://raw.githubusercontent.com/ArchiveTeam/nifty-discovery/master/urls/wikimedia.txt. job:21z8da69732jgmp4g6pn949p4
On 2016-09-15, Hatena bookmarks were scraped with a script and derived, producing a list of 19973 URLs: https://raw.githubusercontent.com/ArchiveTeam/nifty-discovery/master/urls/hatena.txt. job:3i04vcsil92hl80yxbxiimncn
On 2016-09-16, archive.is pages were scraped with a script, derived and deduplicated, producing a list of mere 1165 URLs: https://raw.githubusercontent.com/ArchiveTeam/nifty-discovery/master/urls/archiveis.txt. job:2bkvkya714zxqkity2cmw1w10
User:DoomTay has plucked more URLs from e-shuushuu wiki (job:3spkhvzhep0azp811nk4zelw5) and from Miss Surfersparadise^{[IA•Wcite•.today•MemWeb]} (job:ew3a0olovf2e2pq20ki2fwgra)
On 2016-09-23, almost 80 URLs were scraped from Portalgraphics.net artist data (job:6gjq81kbvhhcjvf6v5z4ysv4i)
From 2016-09-23 to 2016-11-08 thousands more URLs were scraped from a mixture of sources (job:83nkqxzrbuuojnol1yzz4katq, job:de2s3en6ayvo8vtyy91vmc3re, job:dvmhmomc7foe3t3mfbnqptgac, job:1kpy7mk8a5glwq8ne7plb7a83, job:3djy7ku5qhsdh9whcpnk6zkt, job:ad9xia0mpn616k0bjjxss3zcd, job:3xb4h934hh57p1u2pl2dd2qcu)

Next steps

GoogleScraper is no good. Make attempts at scraping, Bing, Twitter using hints on Site exploration
Put chunks of up to 100k URLs onto high speed (20160911.01) ArchiveBot pipelines

Nifty

URL harvesting

Progress

Navigation menu

Search