Difference between revisions of "Deathwatch/Misc"

From Archiveteam
Jump to navigation Jump to search
m (→‎General categories to look for: Added internal link)
 
(16 intermediate revisions by 10 users not shown)
Line 1: Line 1:
:''Main article: [[Deathwatch]]''
:''Main article: [[Deathwatch]]''


== Common Knowledge ==
== Common knowledge and infrastructure ==


* [https://encrypted.google.com/search?&q=inurl%3Arobots.txt+filetype%3Atxt+%2B%22ia_archiver%22 Sites that block the Wayback Machine] are at risk of being completely lost if they ever shut down.
* [https://encrypted.google.com/search?&q=inurl%3Arobots.txt+filetype%3Atxt+%2B%22ia_archiver%22 Sites that block the Wayback Machine] are at risk of being completely lost if they ever shut down.


* [[Google]] has [http://www.seopedia.org/internet-marketing-and-seo/googles-secret-andor-forgotten-places/ quite] [http://www.seopedia.org/seo-news/google-2/googles-56-forgotten-secret-pages-part-two/ a few] old pages on their servers which haven't been updated in a long time. Might be a good idea to save these before they disappear.
* Archive the shutdown announcement pages on dead sites.
** this is being done in every wiki page, pasting the announcement, and archiving when possible at WebCite. [[User:Emijrp|Emijrp]] 19:33, 4 June 2011 (UTC)


* Like Google, Nintendo of Japan has its share of ancient pages, like [http://www.nintendo.co.jp/n02/dmg/mla/index.html this one].
* RSS Feed with death notices. - [[User:Jscott|Jason]]
** I'm taking a shot at this with [http://www.deaddyingdamned.com The Dead, the Dying & the Damned]. --[[User:Auguste|Auguste]] 14:34, 4 March 2011 (UTC)
* The ArchiveTeam twitter might be a good way to broadcast new site obituaries. - psicom


== Other endangered species and misc ideas ==
== General categories to look for ==
* The Swedish photo dairy community Dayviews contains quite a bit of 00s Swedish "youth culture", but is at risk of being shutdown when you least suspect it due to it's sinking user rate.
* The European Pokemon Diamond and Pearl Promotional Website is still up but has rotted away in some parts. Would likely be a good idea to archive it at some point before it all disolves into the ether. http://pokemondiamondandpearl.nintendo-europe.com/
* [http://www.bbc.co.uk/h2g2/ h2g2] - "H2G2 is a constantly expanding, user-generated guide to life, the universe and everything. The site was founded in 1999 by Hitchhiker's Guide to the Galaxy author Douglas Adams." There are plans to buy h2g2 from the BBC (http://www.bbc.co.uk/dna/h2g2/brunel/A80173361).
* Various Image Boards - not the short-lived 4chan clones but the more permanent ones like www.zerochan.net (as of today it has over 1.6 million images, all easily available like this: www.zerochan.net/1627488), Pixiv.net, minitokyo.net
* JoshW's video game music archive (links on http://hcs64.com/mboard/forum.php?showthread=26929). Not a "large" site but many many gigs of 7zipped WAVs
* Suggestion: An archive of .gif and .swf preloaders? [[User:Kuro|Kuro]] 19:49, 29 December 2009 (UTC)
**We can extract all the .gif files from the GeoCities archive and compare them using md5sum to discard dupes. [[User:Emijrp|Emijrp]] 19:58, 21 December 2010 (UTC)
* '''Set up''' an FTP hub which AT members can access and up/down finished projects.
** Internet Archive? jason created a section for Archive Team http://www.archive.org/details/archiveteam [[User:Emijrp|Emijrp]] 19:34, 4 June 2011 (UTC)


* Track the 100+ top [[twitter]] feeds, as designated by one of these idiot Twitter grading sites, and back up on a regular basis the top twitter people, for posterity.
* Archive as many file servers (FTP and HTTP) as possible.
* '''[http://www.groklaw.net/ Groklaw]''' has a [http://www.groklaw.net/article.php?story=20090105033126835 project proposal] that we could help with. - [[User:Jscott|Jason]]
** Now that Groklaw is dead, a mirror ought to be made soon. (Especially because their [http://groklaw.net/robots.txt robots.txt] blocks the Wayback Machine.) --[[User:Mithrandir|Mithrandir]] 20:28, 21 August 2013 (EDT)
* '''Archive''' the shutdown announcement pages on dead sites.
** this is being done in every wiki page, pasting the announcement, and archiving when possible at WebCite. [[User:Emijrp|Emijrp]] 19:33, 4 June 2011 (UTC)


* '''RSS Feed''' with death notices. - [[User:Jscott|Jason]]
* [[TinyURL]] and similar services, scraping/backup - [[User:scumola|Steve]]
** I'm taking a shot at this with [http://www.deaddyingdamned.com The Dead, the Dying & the Damned]. --[[User:Auguste|Auguste]] 14:34, 4 March 2011 (UTC)
* '''Twitter profile''' might be a good way to broadcast new site obituaries. - psicom
* '''[[TinyURL]]''' and similar services, scraping/backup - [[User:scumola|Steve]]
** highlight services that at least allow exporting data ([[Diigo]] that I know of). Next "best" - services that have registeration and enable viewing your URL / saving them by e.g. saving as HTML ([[tr.im]]). Etc. --[[User:Jaakkoh|Jaakkoh]] 05:39, 4 April 2009 (UTC)
** highlight services that at least allow exporting data ([[Diigo]] that I know of). Next "best" - services that have registeration and enable viewing your URL / saving them by e.g. saving as HTML ([[tr.im]]). Etc. --[[User:Jaakkoh|Jaakkoh]] 05:39, 4 April 2009 (UTC)
** see [[urlteam]]. [[User:Emijrp|Emijrp]] 19:33, 4 June 2011 (UTC)
** see [[urlteam]]. [[User:Emijrp|Emijrp]] 19:33, 4 June 2011 (UTC)


* '''[http://symphony21.com/ Symphony]''' could [http://nick-dunn.co.uk/article/symphony-as-a-data-preservation-utility/ potentially be used] for archiving structured XML/RSS feeds to a relational database - [[User:nickdunn|Nick]]
* [[Yahoo!]] makes a habit of shutting things down. Keep an eye on its [https://en.wikipedia.org/wiki/List_of_Yahoo!-owned_sites_and_services services], [https://en.wikipedia.org/wiki/List_of_mergers_and_acquisitions_by_Yahoo! acquisitions], and [https://en.wikipedia.org/wiki/Yahoo_(2017%E2%80%93present)#Brands divestments].
* '''A Firefox plugin''' for redirecting users to our archive when they request a site that's been rescued. - ???
 
**good idea, the problem is that the archives are not hosted as the original, but packed. [[User:Emijrp|Emijrp]] 19:32, 4 June 2011 (UTC)
* Track the 100+ top [[twitter]] feeds, as designated by one of these idiot Twitter grading sites, and back up on a regular basis the top twitter people, for posterity.
**As some like what you propose already exists, this called [[wikipedia:MafiaaFire Redirector|MAFIAAFire Redirector]] (but that only redirects links from domains that have been seized by governments to backup sites) so if anyone wants to do this project, can be start by reviewing how this works extension. Although the files and pages are not hosted on a server as the original, but that all are packed, I read that [[wikipedia:Heritrix|Heritrix]] (the Internet Archive’s web crawler) by default the web resources that inspects are stored in a [[wikipedia:.arc|Arc]] archive, and perhaps could do something similar, but using bzip2, 7z, rar format archives or a combination of the above to manage the resources of a web. --[[User:Swicher|Swicher]] 07:23, 27 July 2011 (UTC)
 
* Various image boards - not the short-lived 4chan clones but the more permanent ones like www.zerochan.net (as of today it has over 1.6 million images, all easily available like this: www.zerochan.net/1627488), Pixiv.net, minitokyo.net
 
* Suggestion: An archive of .gif and .swf preloaders? [[User:Kuro|Kuro]] 19:49, 29 December 2009 (UTC)
**We can extract all the .gif files from the GeoCities archive and compare them using md5sum to discard dupes. [[User:Emijrp|Emijrp]] 19:58, 21 December 2010 (UTC)
 
* Electronics datasheets: [http://alldatasheet.com this], [http://datasheetarchive.com this], [http://www.datasheetcatalog.com this] [http://www.htmldatasheet.com and this] for example. Many of these datasheets are already very hard to find (esp. for older and rarer parts, e.g. those required to emulate old computer systems) and the sites are often in China, Russia or other countries that might give problems in the future. Lots of data to grab, and many of these sites only have very slow bandwidth, so it might be good to start archiving them early. --[[User:Darkstar|Darkstar]] 23:47, 9 April 2011 (UTC)
 
* Archives of MUD, MUSH, MOO game sites and related information.  They won't all be around forever. --[[User:Auguste|Auguste]] 13:59, 24 February 2011 (UTC)
* Archives of MUD, MUSH, MOO game sites and related information.  They won't all be around forever. --[[User:Auguste|Auguste]] 13:59, 24 February 2011 (UTC)
** I'm keeping an eye out for, and archiving sites like [http://www.lambdamoo.info LambdaMOO.info], which are either closing down or may be at risk. --[[User:Auguste|Auguste]] 13:59, 24 February 2011 (UTC)
** I'm keeping an eye out for, and archiving sites like [http://www.lambdamoo.info LambdaMOO.info], which are either closing down or may be at risk. --[[User:Auguste|Auguste]] 13:59, 24 February 2011 (UTC)
* [http://ytmnd.com YTMND] [[User:Zachera|Zachera]] 20:06, 25 March 2011 (UTC)
 
* User-created content in video games are always in danger of a being lost forever. This includes:
** Over 100,000 levels made in games like the Little Big Planet series or Super Mario Maker series.
** Custom golf courses made in The Golf Club series.
** Various banners, spray paints, weapon skins, and insignias from competitive online shooters.
** Maps in multiple Halo games have been modified and shared with the Forge game mode.
** Specialized tracks built in car racing games.
** Almost everything made in the PlayStation 4 game Dreams.
** Decal items (uploaded in SVG format) for PlayStation 4 racer ''[[wikipedia:Gran Turismo Sport|Gran Turismo Sport]]''.
** Every game, group, and nearly every catalog item in [[Roblox]].
** Over 191,000,000 creations made in [[Spore]] and published on the official game site.
 
* BitTorrent DHT - indexed by various projects who tend to get shut down sooner or later:
** [https://btdb.to BTDB] - has had domain seized in the past, [http://btdb.to/about uses gmail for DMCA notices], no other contact info
** [https://btdig.com/ BTDigg] - has had multiple domains seized in the past and died before. Twitter and Facebook is inactive. Contact page doesn't work, but allegedly uses email form
** [https://torrentproject.se/ Torrent Project] - maybe dead, see [[Deathwatch#2017]] for more information
** [https://itorrents.org iTorrents.org] - torrent cache, run by the operator of limetorrents.cc (see whois for contact information)
 
== Specific sites to watch ==
 
* [[Google]] has {{url|1=http://www.seopedia.org/internet-marketing-and-seo/googles-secret-andor-forgotten-places/|2=quite}} {{url|1=http://www.seopedia.org/seo-news/google-2/googles-56-forgotten-secret-pages-part-two/|2=a few}} old pages on their servers which haven't been updated in a long time. Might be a good idea to save these before they disappear.
 
* Like Google, Nintendo of Japan has its share of ancient pages, like {{url|1=http://www.nintendo.co.jp/n02/dmg/mla/index.html|2=this one}}.
 
* SMWStuff.com died due to technical difficulties. Many unbacked up user created data was lost. The data that wasn't went onto [smwstuff.net] which also contains new uploads. Would be good to back it up considering the last blog post was in 2017. It seems to rely on javascript (damn). ''UPDATE:'' All thumbnails and downloads have been archived by ArchiveBot by TheTechRobo and systwi. {{Job|8juzv}} . TheTechRobo has archived the API requests and responses, comments, and metadata into WARC files, but that probably won't go into the WBM. The developer of the website is still doing stuff, but [http://72dpiarmy.supersanctuary.net/index.php?topic=11124.msg202691#msg202691 says the project is on hiatus].
 
* JoshW's video game music archive (links on http://hcs64.com/mboard/forum.php?showthread=26929). Not a "large" site but many many gigs of 7zipped WAVs
 
* [http://www.groklaw.net/ Groklaw] has a [http://www.groklaw.net/article.php?story=20090105033126835 project proposal] that we could help with. - [[User:Jscott|Jason]]
** Now that Groklaw is dead, a mirror ought to be made soon. (Especially because their [http://groklaw.net/robots.txt robots.txt] blocks the Wayback Machine.) --[[User:Mithrandir|Mithrandir]] 20:28, 21 August 2013 (EDT)
 
* [http://c2.com/cgi/wiki?WikiWikiWeb WikiWikiWeb] - The first wiki, is still a valuable source of information on programming patterns and related topics. It's still active, but I'm not sure how much. It's been going since 1995 so its got real historical value. Plus it's all text and wouldn't take much space. The owner Ward Cunningham might be amenable to providing a copy, so I'd suggest contact first.
* [http://c2.com/cgi/wiki?WikiWikiWeb WikiWikiWeb] - The first wiki, is still a valuable source of information on programming patterns and related topics. It's still active, but I'm not sure how much. It's been going since 1995 so its got real historical value. Plus it's all text and wouldn't take much space. The owner Ward Cunningham might be amenable to providing a copy, so I'd suggest contact first.
** I've done this and linked the dump from [[WikiTeam]]. -- [[User:Ca7|Ca7]]
** I've done this and linked the dump from [[WikiTeam]]. -- [[User:Ca7|Ca7]]
* Electronics datasheets: [http://alldatasheet.com this], [http://datasheetarchive.com this], [http://www.datasheetcatalog.com this] [http://www.htmldatasheet.com and this] for example. Many of these datasheets are already very hard to find (esp. for older and rarer parts, e.g. those required to emulate old computer systems) and the sites are often in China, Russia or other countries that might give problems in the future. Lots of data to grab, and many of these sites only have very slow bandwidth, so it might be good to start archiving them early. --[[User:Darkstar|Darkstar]] 23:47, 9 April 2011 (UTC)
 
* '''ElfQuest Comics'''. They've recently all been scanned (6500 pages+) and are available [http://www.elfquest.com/gallery/OnlineComics3.html here]. They're hidden behind a Flash-based viewer though so someone would first have to decompile that to get to the links. --[[User:Darkstar|Darkstar]] 20:55, 18 May 2011 (UTC)
* ElfQuest Comics. They've recently all been scanned (6500 pages+) and are available [http://www.elfquest.com/gallery/OnlineComics3.html here]. They're hidden behind a Flash-based viewer though so someone would first have to decompile that to get to the links. --[[User:Darkstar|Darkstar]] 20:55, 18 May 2011 (UTC)
**Working on getting this finished up, done downloading all the images, just have to package it up. [[User:Underscor|Underscor]] 22:35, 4 June 2011 (UTC)
**Working on getting this finished up, done downloading all the images, just have to package it up. [[User:Underscor|Underscor]] 22:35, 4 June 2011 (UTC)
* '''TechNet Archive''': [http://www.microsoft.com/technet/archive/default.mspx?mfr=true here] "Technical information about older versions of Microsoft products and technologies. This information is scheduled to be removed soon." --[[User:Marceloantonio1|Marceloantonio1]] 08:24, 9 June 2011 (UTC -3)
** They've since switched to an HTML viewer, so archiving it in the future should be much easier --[[User:DoomTay|DoomTay]] ([[User talk:DoomTay|talk]]) 14:52, 30 June 2017 (EDT)
 
* TechNet Archive: [http://www.microsoft.com/technet/archive/default.mspx?mfr=true here] "Technical information about older versions of Microsoft products and technologies. This information is scheduled to be removed soon." --[[User:Marceloantonio1|Marceloantonio1]] 08:24, 9 June 2011 (UTC -3)
**TechNet, and its big cousin, MSDN, are already being archived by other sites. For example, {{url|1=http://betaarchive.com}} has archived a huge pile of them, including older ones from the late 90's)
**TechNet, and its big cousin, MSDN, are already being archived by other sites. For example, {{url|1=http://betaarchive.com}} has archived a huge pile of them, including older ones from the late 90's)
* '''[[Jux]]''' was going to get jammed on August 31, 2013, but not anymore. Still might be a good idea to keep them on the radar.
 
* Archive as many file servers (FTP and HTTP) as possible.
* [[Jux]] was going to get jammed on August 31, 2013, but not anymore. Still might be a good idea to keep them on the radar.
* '''[[Google Answers]]''' has no longer been accepting new questions for a while, and whether it will remain for a while is debatable.
 
* '''Newgrounds''' is one of the largest collections of Flash games and movies on the Internet. It would be a shame if it all disappeared.
* [[Google Answers]] has no longer been accepting new questions for a while, and whether it will remain for a while is debatable.
* [[Yahoo!]] has decided to shut down more services, including [[Yahoo! Stars India]], [[Yahoo! Neighbors]], etc. These should be archived before they shut down. Also, yodel.yahoo.com seems to have been replaced by yahoo.tumblr.com, and should be archived too.
 
* Newgrounds is one of the largest collections of Flash games and movies on the Internet. It would be a shame if it all disappeared.
 
* Archive every [http://www.google.com/doodles/ Google Doodle].
* Archive every [http://www.google.com/doodles/ Google Doodle].
* Save all the [http://www.emergencyalert.alberta.ca Alberta Emergency Alerts].
* Save all the [http://www.emergencyalert.alberta.ca Alberta Emergency Alerts].


* http://atheistpictures.com/
* http://atheistpictures.com/
* Not if this goes here, but I have an idea for development an program that facilitates the detection of links that belong to certain sites. What do I mean by this?, Is that in my experience with the work in [[Windows Live Spaces]] archiving (and other projects that I've only checked), a problem that apparently occurs frequently is the search of links to those sites whose content will be archived; for example, the links of a Windows Live Space was whatever.spaces.ive.com or a video on Google Video is video.google.com/videoplay?docid=-[video ID number] and so therefore the problem in question is , where do I find the links to pages, videos, articles or anything of a site X and later archive the contents of the same?. Perhaps the most obvious answer is using the API of one or more search engines, but the [http://code.google.com/apis/ajaxsearch/documentation/reference.html Google Web Search API] is currently depreciated (besides being very limited), the [http://developer.yahoo.com/search/siteexplorer/ Site Explorer API] of Yahoo apparently stop working on Sept. 15 and to use the [http://msdn.microsoft.com/en-us/library/dd251020.aspx Bing's API] is required to have a registered AppId (from other search engines I have not checked, but I mention these because they are the most used). Well, because the APIs of the search engines do come with some problems for this project, then I think a good solution would opt to use the [http://www.google.com/search?q=%28automating|automatic|automation|automatization%29+web+%28browsing|browser%29 automation of the web browser] (that would be done the search/es required in (almost) all web searchers, traverse all the results found and to keep the corresponding links in somewhere). Maybe now some are wondering, why use that automatization if it can do likewise programmatically sending [[wikipedia:Hypertext Transfer Protocol#Request message|HTTP request]] to the server and parsing the HTML with the results?. Answer: It is true, it can also be done, but there is a "small" problem; search engines like Google and Bing have a dynamic HTML that when reviewing the source code of some of its results page, looks basically a mishmash of HTML and Javascript code hard to analyze, but this is solved with browser automatization because through this way the code of the search results page of the site would already be "served" for parsing because the browser interpret the code received from the server and convert this to commonplace HTML in RAM (or something) to illustrate this better I leave an example:


:[[File:Behavior of a dynamic page.PNG|thumb|left|Clicking on the picture can read a very detailed description of the four screenshots that compose this (besides being able to observe the image to full resolution)]]
* [http://www.harmonycentral.com Harmony Central] User (-submitted) Reviews were around for over a decade and covered just about every musical instrument and related accessory commercially sold. Site updates have caused these to be offline, though admins say the data still exists. As far as can be determined, Archive.org has little if any of these reviews. [http://www.harmonycentral.com/t5/Feedback/User-reviews/td-p/34660122 This thread] has the whole story. --[[User:Benbradley|Benbradley]] 20:41, 13 July 2013 (EDT)
:With this way of doing this also solves another detail; maintainability and adaptability of the code because the browser using automatization, all you have to do is indicate the search engine results page, the search term (which would something like site:whatever.com, inurl:.whatever.com/ and stuff like that), the tag where are the links results and what is the button "Next" (therefore this reduces the times of development and implementation for each particular search engine and without writing too much code). If anyone is still interested in the idea after that long explanation, then I will tell that between the browser automatization applications on which I have read, there are two that I have called attention, one is [http://watir.com/ Watir] (programmed in Ruby but is cross-platform and multibrowser) and [http://seleniumhq.org/projects/remote-control/ Selenium Remote Control] (also is cross-platform and multibrowser but unlike the previous one, this API supports C#, Java, Perl, PHP, Python and Ruby) so if anyone wants to realize this project, then can choose one of these applications to start (or other similar to the above). --[[User:Swicher|Swicher]] 09:41, 1 August 2011 (UTC)


* [http://www.harmonycentral.com Harmony Central] User (-submitted) Reviews were around for over a decade and covered just about every musical instrument and related accessory commercially sold. Site updates have caused these to be offline, though admins say the data still exists. As far as can be determined, Archive.org has little if any of these reviews. [http://www.harmonycentral.com/t5/Feedback/User-reviews/td-p/34660122 This thread] has the whole story. --[[User:Benbradley|Benbradley]] 20:41, 13 July 2013 (EDT)
* <s>[http://strawpoll.me/ Strawpoll] is a very simple poll site which looks like somebody's weekend project, but I've seen it used a lot in the speedrunning community.  Very simple indexed structure - http://strawpoll.me/0 to http://strawpoll.me/2317429 at time of editing.  Would be nice to have a backup in case it disappeared one day.  Could be very well a one person project.  --[[User:Sanqui|Sanqui]] 10:20, 11 August 2014 (EDT)</s>Dead. See [[Strawpoll.me]] for more info.</s>
* [http://strawpoll.me/ Strawpoll] is a very simple poll site which looks like somebody's weekend project, but I've seen it used a lot in the speedrunning community.  Very simple indexed structure - http://strawpoll.me/0 to http://strawpoll.me/2317429 at time of editing.  Would be nice to have a backup in case it disappeared one day.  Could be very well a one person project.  --[[User:Sanqui|Sanqui]] 10:20, 11 August 2014 (EDT)


* 20 newspapers in Quebec will shutdown in the coming weeks. Here's a list [http://pastebin.com/Xwt19JFQ] of those still up that needs to be archived ASAP.
* 20 newspapers in Quebec will shutdown in the coming weeks. Here's a list [http://pastebin.com/Xwt19JFQ] of those still up that needs to be archived ASAP.
Line 76: Line 102:


* WordChamp was supposed to have shut down on June 30, 2013, later changed September 15, 2013, but is still up and running.
* WordChamp was supposed to have shut down on June 30, 2013, later changed September 15, 2013, but is still up and running.
* Hewlett-Packard removes any documentation for products that reach their end of life, usually when said product is 10 years old. https://support.hp.com/us-en/retired-products --[[User:DoomTay|DoomTay]] ([[User talk:DoomTay|talk]]) 14:52, 30 June 2017 (EDT)
* Louis Rossman runs a business repairing Apple Macbook motherboards and making videos on how to do these repairs, business advice and related philosophy. These repairs are considered "unauthorized" by Apple and he has been threatened with shutdowns in the past.
** He has a [[YouTube]] [https://www.youtube.com/channel/UCl2mFZoRqjw_ELax4Yisf6w channel] and [https://www.rossmanngroup.com/ some] [https://www.rossmanngroup.com/boards business] [https://mailin.repair/ websites].
* Enderman's YouTube channel has been gotten videos taken down left and right. He makes videos about reverse-engineering Windows.[https://www.youtube.com/watch?v=ssu8Mv7hSdc]
* 8chan disappeared from the clearnet in early August 2019, returned to clearnet in October 2019 rebranded as 8kun, and seems to have disappeared again.
* [[SteamGridDB]] is unable to pay for its servers, and will shut down on 31st March 2020 unless there is some sort of miraculous Patreon/sponsor intervention.
** [https://blog.steamgriddb.com/2020-year-in-review#patreon Appears to be surviving on Patreon income.]
* [https://racing-reference.info Racing-Reference] (motorsports/NASCAR stats website, owned by NASCAR itself since 2017) discontinued [https://www.reddit.com/r/NASCAR/comments/l0r1m9/as_of_january_13_2021_racingreference_has/ user blogs in January 2021] and [https://twitter.com/racingreference/status/1387226679379177478 comments section in April]. These things are still accessible for those who knows the URL format for now, but it (some also fears this might also happen with non-NASCAR sanctioned content, such as Formula One and Formula E stuff) might be completely gone anytime soon.
== Misc ideas ==
* Not if this goes here, but I have an idea for development an program that facilitates the detection of links that belong to certain sites. What do I mean by this?, Is that in my experience with the work in [[Windows Live Spaces]] archiving (and other projects that I've only checked), a problem that apparently occurs frequently is the search of links to those sites whose content will be archived; for example, the links of a Windows Live Space was whatever.spaces.ive.com or a video on Google Video is video.google.com/videoplay?docid=-[video ID number] and so therefore the problem in question is, where do I find the links to pages, videos, articles or anything of a site X and later archive the contents of the same?. Perhaps the most obvious answer is using the API of one or more search engines, but the [http://code.google.com/apis/ajaxsearch/documentation/reference.html Google Web Search API] is currently depreciated (besides being very limited), the [http://developer.yahoo.com/search/siteexplorer/ Site Explorer API] of Yahoo apparently stop working on Sept. 15 and to use the [http://msdn.microsoft.com/en-us/library/dd251020.aspx Bing's API] is required to have a registered AppId (from other search engines I have not checked, but I mention these because they are the most used). Well, because the APIs of the search engines do come with some problems for this project, then I think a good solution would opt to use the [http://www.google.com/search?q=%28automating|automatic|automation|automatization%29+web+%28browsing|browser%29 automation of the web browser] (that would be done the search/es required in (almost) all web searchers, traverse all the results found and to keep the corresponding links in somewhere). Maybe now some are wondering, why use that automatization if it can do likewise programmatically sending [[wikipedia:Hypertext Transfer Protocol#Request message|HTTP request]] to the server and parsing the HTML with the results?. Answer: It is true, it can also be done, but there is a "small" problem; search engines like Google and Bing have a dynamic HTML that when reviewing the source code of some of its results page, looks basically a mishmash of HTML and Javascript code hard to analyze, but this is solved with browser automatization because through this way the code of the search results page of the site would already be "served" for parsing because the browser interpret the code received from the server and convert this to commonplace HTML in RAM (or something) to illustrate this better I leave an example:
:[[File:Behavior of a dynamic page.PNG|thumb|left|Clicking on the picture can read a very detailed description of the four screenshots that compose this (besides being able to observe the image to full resolution)]]
:With this way of doing this also solves another detail; maintainability and adaptability of the code because the browser using automatization, all you have to do is indicate the search engine results page, the search term (which would something like site:whatever.com, inurl:.whatever.com/ and stuff like that), the tag where are the links results and what is the button "Next" (therefore this reduces the times of development and implementation for each particular search engine and without writing too much code). If anyone is still interested in the idea after that long explanation, then I will tell that between the browser automatization applications on which I have read, there are two that I have called attention, one is [http://watir.com/ Watir] (programmed in Ruby but is cross-platform and multibrowser) and [http://seleniumhq.org/projects/remote-control/ Selenium Remote Control] (also is cross-platform and multibrowser but unlike the previous one, this API supports C#, Java, Perl, PHP, Python and Ruby) so if anyone wants to realize this project, then can choose one of these applications to start (or other similar to the above). --[[User:Swicher|Swicher]] 09:41, 1 August 2011 (UTC)

Latest revision as of 17:52, 8 October 2022

Main article: Deathwatch

Common knowledge and infrastructure

  • Archive the shutdown announcement pages on dead sites.
    • this is being done in every wiki page, pasting the announcement, and archiving when possible at WebCite. Emijrp 19:33, 4 June 2011 (UTC)
  • RSS Feed with death notices. - Jason
  • The ArchiveTeam twitter might be a good way to broadcast new site obituaries. - psicom

General categories to look for

  • Archive as many file servers (FTP and HTTP) as possible.
  • TinyURL and similar services, scraping/backup - Steve
    • highlight services that at least allow exporting data (Diigo that I know of). Next "best" - services that have registeration and enable viewing your URL / saving them by e.g. saving as HTML (tr.im). Etc. --Jaakkoh 05:39, 4 April 2009 (UTC)
    • see urlteam. Emijrp 19:33, 4 June 2011 (UTC)
  • Track the 100+ top twitter feeds, as designated by one of these idiot Twitter grading sites, and back up on a regular basis the top twitter people, for posterity.
  • Various image boards - not the short-lived 4chan clones but the more permanent ones like www.zerochan.net (as of today it has over 1.6 million images, all easily available like this: www.zerochan.net/1627488), Pixiv.net, minitokyo.net
  • Suggestion: An archive of .gif and .swf preloaders? Kuro 19:49, 29 December 2009 (UTC)
    • We can extract all the .gif files from the GeoCities archive and compare them using md5sum to discard dupes. Emijrp 19:58, 21 December 2010 (UTC)
  • Electronics datasheets: this, this, this and this for example. Many of these datasheets are already very hard to find (esp. for older and rarer parts, e.g. those required to emulate old computer systems) and the sites are often in China, Russia or other countries that might give problems in the future. Lots of data to grab, and many of these sites only have very slow bandwidth, so it might be good to start archiving them early. --Darkstar 23:47, 9 April 2011 (UTC)
  • Archives of MUD, MUSH, MOO game sites and related information. They won't all be around forever. --Auguste 13:59, 24 February 2011 (UTC)
    • I'm keeping an eye out for, and archiving sites like LambdaMOO.info, which are either closing down or may be at risk. --Auguste 13:59, 24 February 2011 (UTC)
  • User-created content in video games are always in danger of a being lost forever. This includes:
    • Over 100,000 levels made in games like the Little Big Planet series or Super Mario Maker series.
    • Custom golf courses made in The Golf Club series.
    • Various banners, spray paints, weapon skins, and insignias from competitive online shooters.
    • Maps in multiple Halo games have been modified and shared with the Forge game mode.
    • Specialized tracks built in car racing games.
    • Almost everything made in the PlayStation 4 game Dreams.
    • Decal items (uploaded in SVG format) for PlayStation 4 racer Gran Turismo Sport.
    • Every game, group, and nearly every catalog item in Roblox.
    • Over 191,000,000 creations made in Spore and published on the official game site.
  • BitTorrent DHT - indexed by various projects who tend to get shut down sooner or later:
    • BTDB - has had domain seized in the past, uses gmail for DMCA notices, no other contact info
    • BTDigg - has had multiple domains seized in the past and died before. Twitter and Facebook is inactive. Contact page doesn't work, but allegedly uses email form
    • Torrent Project - maybe dead, see Deathwatch#2017 for more information
    • iTorrents.org - torrent cache, run by the operator of limetorrents.cc (see whois for contact information)

Specific sites to watch

  • SMWStuff.com died due to technical difficulties. Many unbacked up user created data was lost. The data that wasn't went onto [smwstuff.net] which also contains new uploads. Would be good to back it up considering the last blog post was in 2017. It seems to rely on javascript (damn). UPDATE: All thumbnails and downloads have been archived by ArchiveBot by TheTechRobo and systwi. job:8juzv . TheTechRobo has archived the API requests and responses, comments, and metadata into WARC files, but that probably won't go into the WBM. The developer of the website is still doing stuff, but says the project is on hiatus.
  • WikiWikiWeb - The first wiki, is still a valuable source of information on programming patterns and related topics. It's still active, but I'm not sure how much. It's been going since 1995 so its got real historical value. Plus it's all text and wouldn't take much space. The owner Ward Cunningham might be amenable to providing a copy, so I'd suggest contact first.
    • I've done this and linked the dump from WikiTeam. -- Ca7
  • ElfQuest Comics. They've recently all been scanned (6500 pages+) and are available here. They're hidden behind a Flash-based viewer though so someone would first have to decompile that to get to the links. --Darkstar 20:55, 18 May 2011 (UTC)
    • Working on getting this finished up, done downloading all the images, just have to package it up. Underscor 22:35, 4 June 2011 (UTC)
    • They've since switched to an HTML viewer, so archiving it in the future should be much easier --DoomTay (talk) 14:52, 30 June 2017 (EDT)
  • TechNet Archive: here "Technical information about older versions of Microsoft products and technologies. This information is scheduled to be removed soon." --Marceloantonio1 08:24, 9 June 2011 (UTC -3)
    • TechNet, and its big cousin, MSDN, are already being archived by other sites. For example, http://betaarchive.com[IAWcite.todayMemWeb] has archived a huge pile of them, including older ones from the late 90's)
  • Jux was going to get jammed on August 31, 2013, but not anymore. Still might be a good idea to keep them on the radar.
  • Google Answers has no longer been accepting new questions for a while, and whether it will remain for a while is debatable.
  • Newgrounds is one of the largest collections of Flash games and movies on the Internet. It would be a shame if it all disappeared.
  • Harmony Central User (-submitted) Reviews were around for over a decade and covered just about every musical instrument and related accessory commercially sold. Site updates have caused these to be offline, though admins say the data still exists. As far as can be determined, Archive.org has little if any of these reviews. This thread has the whole story. --Benbradley 20:41, 13 July 2013 (EDT)
  • Strawpoll is a very simple poll site which looks like somebody's weekend project, but I've seen it used a lot in the speedrunning community. Very simple indexed structure - http://strawpoll.me/0 to http://strawpoll.me/2317429 at time of editing. Would be nice to have a backup in case it disappeared one day. Could be very well a one person project. --Sanqui 10:20, 11 August 2014 (EDT)Dead. See Strawpoll.me for more info.
  • 20 newspapers in Quebec will shutdown in the coming weeks. Here's a list [1] of those still up that needs to be archived ASAP.
  • Rue Frontenac was a website created during a newspaper lockout in Canada back in 2009. It was saved here , but I'm not sure if anybody is maintaining it. Copy ?
  • LEGO has a bad habit of deleting Flash games and other materials from their sites. Some of them still lie in pieces on cache.lego.com, awaiting their deletion. Fortunately, some games are still available to play on BioMediaProject or 4T2 Portfolio.
  • These sites are getting an update in the next few months:
    • Lincs FM, Trax FM, Rutland Radio [www rutlandradio.co.uk - spam filter on here blocked this url], Dearne FM, Rother FM, Compass FM, KCFM 99.8, Ridings FM. All are getting an update, so you might want to back these up; not sure what the best means are, but making a mirror of Lincs FM Group websites is good for historical reasons.
  • WordChamp was supposed to have shut down on June 30, 2013, later changed September 15, 2013, but is still up and running.
  • Louis Rossman runs a business repairing Apple Macbook motherboards and making videos on how to do these repairs, business advice and related philosophy. These repairs are considered "unauthorized" by Apple and he has been threatened with shutdowns in the past.
  • Enderman's YouTube channel has been gotten videos taken down left and right. He makes videos about reverse-engineering Windows.[2]
  • 8chan disappeared from the clearnet in early August 2019, returned to clearnet in October 2019 rebranded as 8kun, and seems to have disappeared again.
  • Racing-Reference (motorsports/NASCAR stats website, owned by NASCAR itself since 2017) discontinued user blogs in January 2021 and comments section in April. These things are still accessible for those who knows the URL format for now, but it (some also fears this might also happen with non-NASCAR sanctioned content, such as Formula One and Formula E stuff) might be completely gone anytime soon.

Misc ideas

  • Not if this goes here, but I have an idea for development an program that facilitates the detection of links that belong to certain sites. What do I mean by this?, Is that in my experience with the work in Windows Live Spaces archiving (and other projects that I've only checked), a problem that apparently occurs frequently is the search of links to those sites whose content will be archived; for example, the links of a Windows Live Space was whatever.spaces.ive.com or a video on Google Video is video.google.com/videoplay?docid=-[video ID number] and so therefore the problem in question is, where do I find the links to pages, videos, articles or anything of a site X and later archive the contents of the same?. Perhaps the most obvious answer is using the API of one or more search engines, but the Google Web Search API is currently depreciated (besides being very limited), the Site Explorer API of Yahoo apparently stop working on Sept. 15 and to use the Bing's API is required to have a registered AppId (from other search engines I have not checked, but I mention these because they are the most used). Well, because the APIs of the search engines do come with some problems for this project, then I think a good solution would opt to use the automation of the web browser (that would be done the search/es required in (almost) all web searchers, traverse all the results found and to keep the corresponding links in somewhere). Maybe now some are wondering, why use that automatization if it can do likewise programmatically sending HTTP request to the server and parsing the HTML with the results?. Answer: It is true, it can also be done, but there is a "small" problem; search engines like Google and Bing have a dynamic HTML that when reviewing the source code of some of its results page, looks basically a mishmash of HTML and Javascript code hard to analyze, but this is solved with browser automatization because through this way the code of the search results page of the site would already be "served" for parsing because the browser interpret the code received from the server and convert this to commonplace HTML in RAM (or something) to illustrate this better I leave an example:
Clicking on the picture can read a very detailed description of the four screenshots that compose this (besides being able to observe the image to full resolution)
With this way of doing this also solves another detail; maintainability and adaptability of the code because the browser using automatization, all you have to do is indicate the search engine results page, the search term (which would something like site:whatever.com, inurl:.whatever.com/ and stuff like that), the tag where are the links results and what is the button "Next" (therefore this reduces the times of development and implementation for each particular search engine and without writing too much code). If anyone is still interested in the idea after that long explanation, then I will tell that between the browser automatization applications on which I have read, there are two that I have called attention, one is Watir (programmed in Ruby but is cross-platform and multibrowser) and Selenium Remote Control (also is cross-platform and multibrowser but unlike the previous one, this API supports C#, Java, Perl, PHP, Python and Ruby) so if anyone wants to realize this project, then can choose one of these applications to start (or other similar to the above). --Swicher 09:41, 1 August 2011 (UTC)