Difference between revisions of "Nifty"

Revision as of 16:03, 16 September 2016

Nifty
Japanese ISP with web hosting
URL	homepage.nifty.com
Status	Closing
Archiving status	In progress...
Archiving type	Unknown
Project source	https://github.com/ArchiveTeam/nifty-discovery
IRC channel	#niftyjanai (on hackint)

Japanese ISP providing web hosting. Will be closing about 140,000 unclaimed homepages by 2016-09-29. Termination notice^{[IA•Wcite•.today•MemWeb]} (Japanese)

http://homepage1.nifty.com/USERNAME/
http://homepage2.nifty.com/USERNAME/
http://homepage3.nifty.com/USERNAME/

URL harvesting

Let's follow Site exploration.

<polm> One thing I would recommend is searching Hatena Bookmarks, which is like a Japanese free Pinboard
<polm> Like so: http://b.hatena.ne.jp/entrylist?url=homepage2.nifty.com
<polm> the "of" query parameter paginates like so: http://b.hatena.ne.jp/entrylist?url=homepage2.nifty.com&of=20
<zout> there's some here. https://archive.is/homepage2.nifty.com

Progress

On 2016-09-12, User:Sanqui harvested 8884 *.nifty.com URLs from Wikimedia sites using mwlinkscrape
On 2016-09-13, root homepages were added to this list, making it 11423 URLs: https://raw.githubusercontent.com/ArchiveTeam/nifty-discovery/master/urls/wikimedia.txt. ArchiveBot job ident 21z8da69732jgmp4g6pn949p4
On 2016-09-15, Hatena bookmarks were scraped with a script and derived, producing a list of 19973 URLs: https://raw.githubusercontent.com/ArchiveTeam/nifty-discovery/master/urls/hatena.txt. ArchiveBot job ident 3i04vcsil92hl80yxbxiimncn
On 2016-09-16, archive.is pages were scraped with a script, derived and deduplicated, producing a list of mere 1165 URLs: https://raw.githubusercontent.com/ArchiveTeam/nifty-discovery/master/urls/archiveis.txt. ArchiveBot job ident 2bkvkya714zxqkity2cmw1w10

Next steps

GoogleScraper is no good. Make attempts at scraping, Bing, Twitter using hints on Site exploration
Scrape http://e-shuushuu.net/ (DoomTay)
Put chunks of up to 100k URLs onto high speed (20160911.01) ArchiveBot pipelines

@@ Line 5: / Line 5: @@
 | description = Japanese ISP with web hosting
 | project_status = {{closing}}
-| archiving_status = {{notsaved}}
+| archiving_status = {{inprogress}}
-| source =
+| source = https://github.com/ArchiveTeam/nifty-discovery
-| irc =
+| irc = niftyjanai
 }}

Difference between revisions of "Nifty"

Revision as of 16:03, 16 September 2016

URL harvesting

Progress

Navigation menu

Search