Difference between revisions of "Nifty"

Revision as of 21:01, 13 September 2016

Nifty
Japanese ISP with web hosting
URL	homepage.nifty.com
Status	Closing
Archiving status	Not saved yet
Archiving type	Unknown
IRC channel	#archiveteam-bs (on hackint)

Japanese ISP providing web hosting. Will be closing about 140,000 unclaimed homepages by 2016-09-29. Termination notice^{[IA•Wcite•.today•MemWeb]} (Japanese)

http://homepage.nifty.com/USERNAME/
http://homepage2.nifty.com/USERNAME/
http://homepage3.nifty.com/USERNAME/

URL harvesting

Let's follow Site exploration.

<polm> One thing I would recommend is searching Hatena Bookmarks, which is like a Japanese free Pinboard
<polm> Like so: http://b.hatena.ne.jp/entrylist?url=homepage2.nifty.com
<polm> the "of" query parameter paginates like so: http://b.hatena.ne.jp/entrylist?url=homepage2.nifty.com&of=20
<zout> there's some here. https://archive.is/homepage2.nifty.com

Progress

On 2016-09-12, User:Sanqui harvested 8884 *.nifty.com URLs from Wikimedia sites using mwlinkscrape
On 2016-09-13, root homepages were added to this list, making it 11423 URLs: https://sanqui. rustedlogic.net/etc/archiveteam/nifty_wikimedia_sites_fix.txt. ArchiveBot job ident 21z8da69732jgmp4g6pn949p4

Next steps

GoogleScraper is no good. Make attempts at scraping, Bing, Twitter using hints on Site exploration
Scrape hatena
Scrape archive.is
Put chunks of up to 100k URLs onto high speed (20160911.01) ArchiveBot pipelines

@@ Line 31: / Line 31: @@
 * On 2016-09-12, [[User:Sanqui]] harvested 8884 *.nifty.com URLs from Wikimedia sites using [[Site exploration#MediaWiki wikis|mwlinkscrape]]
+* On 2016-09-13, root homepages were added to this list, making it 11423 URLs: https://sanqui. rustedlogic.net/etc/archiveteam/nifty_wikimedia_sites_fix.txt.  ArchiveBot job ident <tt>21z8da69732jgmp4g6pn949p4</tt>
 Next steps
-* Make attempts at scraping Google, Bing, Twitter using hints on [[Site exploration]]
+* GoogleScraper is no good.  Make attempts at scraping, Bing, Twitter using hints on [[Site exploration]]
 * Scrape hatena
 * Scrape archive.is
-* Write a script to unravel URLs (when only a subpage was linked, we want to get the homepage itself too), order strategically by some simple heuristic (Wikipedia gets priority, then high ranking sites on Google, etc.)
+* Put chunks of up to 100k URLs onto high speed (20160911.01) ArchiveBot pipelines
-* Begin feeding lists, split into reasonable chunks, into ArchiveBot after consulting with yipdw

Difference between revisions of "Nifty"

Revision as of 21:01, 13 September 2016

URL harvesting

Progress

Navigation menu

Search