Difference between revisions of "Nifty"

Revision as of 16:11, 15 September 2016

Nifty
Japanese ISP with web hosting
URL	homepage.nifty.com
Status	Closing
Archiving status	Not saved yet
Archiving type	Unknown
IRC channel	#archiveteam-bs (on hackint)

Japanese ISP providing web hosting. Will be closing about 140,000 unclaimed homepages by 2016-09-29. Termination notice^{[IA•Wcite•.today•MemWeb]} (Japanese)

http://homepage1.nifty.com/USERNAME/
http://homepage2.nifty.com/USERNAME/
http://homepage3.nifty.com/USERNAME/

URL harvesting

Let's follow Site exploration.

<polm> One thing I would recommend is searching Hatena Bookmarks, which is like a Japanese free Pinboard
<polm> Like so: http://b.hatena.ne.jp/entrylist?url=homepage2.nifty.com
<polm> the "of" query parameter paginates like so: http://b.hatena.ne.jp/entrylist?url=homepage2.nifty.com&of=20
<zout> there's some here. https://archive.is/homepage2.nifty.com

Progress

On 2016-09-12, User:Sanqui harvested 8884 *.nifty.com URLs from Wikimedia sites using mwlinkscrape
On 2016-09-13, root homepages were added to this list, making it 11423 URLs: https://sanqui. rustedlogic.net/etc/archiveteam/nifty_wikimedia_sites_fix.txt. ArchiveBot job ident 21z8da69732jgmp4g6pn949p4
On 2016-09-15, Hatena bookmarks were scraped with a script and derived, producing a list of 19973 URLs: https://raw.githubusercontent.com/Sanqui/archiveteam-nifty/master/urls/hatena.txt

Next steps

GoogleScraper is no good. Make attempts at scraping, Bing, Twitter using hints on Site exploration
Scrape archive.is
Put chunks of up to 100k URLs onto high speed (20160911.01) ArchiveBot pipelines

@@ Line 32: / Line 32: @@
 * On 2016-09-12, [[User:Sanqui]] harvested 8884 *.nifty.com URLs from Wikimedia sites using [[Site exploration#MediaWiki wikis|mwlinkscrape]]
 * On 2016-09-13, root homepages were added to this list, making it 11423 URLs: https://sanqui. rustedlogic.net/etc/archiveteam/nifty_wikimedia_sites_fix.txt.  ArchiveBot job ident <tt>21z8da69732jgmp4g6pn949p4</tt>
+* On 2016-09-15, Hatena bookmarks were scraped with [https://github.com/Sanqui/archiveteam-nifty/blob/master/scrape_hatena.py a script] and derived, producing a list of 19973 URLs: https://raw.githubusercontent.com/Sanqui/archiveteam-nifty/master/urls/hatena.txt
 Next steps
 * GoogleScraper is no good.  Make attempts at scraping, Bing, Twitter using hints on [[Site exploration]]
-* Scrape hatena
 * Scrape archive.is
 * Put chunks of up to 100k URLs onto high speed (20160911.01) ArchiveBot pipelines

Difference between revisions of "Nifty"

Revision as of 16:11, 15 September 2016

URL harvesting

Progress

Navigation menu

Search