Difference between revisions of "Nifty"
Jump to navigation
Jump to search
(homepage1 not homepage) |
(Hatena was scraped) |
||
Line 32: | Line 32: | ||
* On 2016-09-12, [[User:Sanqui]] harvested 8884 *.nifty.com URLs from Wikimedia sites using [[Site exploration#MediaWiki wikis|mwlinkscrape]] | * On 2016-09-12, [[User:Sanqui]] harvested 8884 *.nifty.com URLs from Wikimedia sites using [[Site exploration#MediaWiki wikis|mwlinkscrape]] | ||
* On 2016-09-13, root homepages were added to this list, making it 11423 URLs: https://sanqui. rustedlogic.net/etc/archiveteam/nifty_wikimedia_sites_fix.txt. ArchiveBot job ident <tt>21z8da69732jgmp4g6pn949p4</tt> | * On 2016-09-13, root homepages were added to this list, making it 11423 URLs: https://sanqui. rustedlogic.net/etc/archiveteam/nifty_wikimedia_sites_fix.txt. ArchiveBot job ident <tt>21z8da69732jgmp4g6pn949p4</tt> | ||
* On 2016-09-15, Hatena bookmarks were scraped with [https://github.com/Sanqui/archiveteam-nifty/blob/master/scrape_hatena.py a script] and derived, producing a list of 19973 URLs: https://raw.githubusercontent.com/Sanqui/archiveteam-nifty/master/urls/hatena.txt | |||
Next steps | Next steps | ||
* GoogleScraper is no good. Make attempts at scraping, Bing, Twitter using hints on [[Site exploration]] | * GoogleScraper is no good. Make attempts at scraping, Bing, Twitter using hints on [[Site exploration]] | ||
* Scrape archive.is | * Scrape archive.is | ||
* Put chunks of up to 100k URLs onto high speed (20160911.01) ArchiveBot pipelines | * Put chunks of up to 100k URLs onto high speed (20160911.01) ArchiveBot pipelines |
Revision as of 16:11, 15 September 2016
Nifty | |
Japanese ISP with web hosting | |
URL | homepage.nifty.com |
Status | Closing |
Archiving status | Not saved yet |
Archiving type | Unknown |
IRC channel | #archiveteam-bs (on hackint) |
Japanese ISP providing web hosting. Will be closing about 140,000 unclaimed homepages by 2016-09-29. Termination notice[IA•Wcite•.today•MemWeb] (Japanese)
http://homepage1.nifty.com/USERNAME/ http://homepage2.nifty.com/USERNAME/ http://homepage3.nifty.com/USERNAME/
URL harvesting
Let's follow Site exploration.
<polm> One thing I would recommend is searching Hatena Bookmarks, which is like a Japanese free Pinboard <polm> Like so: http://b.hatena.ne.jp/entrylist?url=homepage2.nifty.com <polm> the "of" query parameter paginates like so: http://b.hatena.ne.jp/entrylist?url=homepage2.nifty.com&of=20 <zout> there's some here. https://archive.is/homepage2.nifty.com
Progress
- On 2016-09-12, User:Sanqui harvested 8884 *.nifty.com URLs from Wikimedia sites using mwlinkscrape
- On 2016-09-13, root homepages were added to this list, making it 11423 URLs: https://sanqui. rustedlogic.net/etc/archiveteam/nifty_wikimedia_sites_fix.txt. ArchiveBot job ident 21z8da69732jgmp4g6pn949p4
- On 2016-09-15, Hatena bookmarks were scraped with a script and derived, producing a list of 19973 URLs: https://raw.githubusercontent.com/Sanqui/archiveteam-nifty/master/urls/hatena.txt
Next steps
- GoogleScraper is no good. Make attempts at scraping, Bing, Twitter using hints on Site exploration
- Scrape archive.is
- Put chunks of up to 100k URLs onto high speed (20160911.01) ArchiveBot pipelines