https://wiki.archiveteam.org/api.php?action=feedcontributions&user=Yipdw&feedformat=atomArchiveteam - User contributions [en]2024-03-28T09:40:40ZUser contributionsMediaWiki 1.37.1https://wiki.archiveteam.org/index.php?title=Template:Navigation_box&diff=27445Template:Navigation box2017-01-16T15:56:14Z<p>Yipdw: Undo revision 26845 by Megalanya0 (talk)</p>
<hr />
<div><br clear="all" /><center><!--<br />
<br />
<br />
<br />
<br />
Rows are in Alphabetic order. Except "Current events" at the top and "About Archive Team" at the bottom.<br />
Items inside rows are in Alphabetic order too.<br />
Easy : )<br />
<br />
<br />
<br />
<br />
--><br />
{| class="mw-collapsible mw-collapsed" style="border: 1px solid #aaa; background-color: #f9f9f9; color: black; margin: 0.5em 0 0.5em 1em; padding: 0.2em; font-size: 100%;"<br />
| colspan=3 align=center style="background: #ccccff;" | <span style="float: right;"><span class="plainlinks">[[{{fullurl:Template:Navigation_box}} view]]&nbsp;&nbsp;[[{{fullurl:Template:Navigation_box|action=edit}} edit]]</span>&nbsp;</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;'''[[Archive Team]]'''&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''[[Archiveteam:Current events|Current events]]''' || [[Alive... OR ARE THEY]] {{·}} [[Deathwatch]] {{·}} [[Projects]] || rowspan=5 | [[File:Archiveteam.jpg|right|150px]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''[[Archiving projects]]''' || [[APKMirror]] {{·}} [[Archive.is]] {{·}} [[BetaArchive]] {{·}} [[Government Backup]] {{·}} [[Gmane]] {{·}} [[Internet Archive]] {{·}} [[It Died]] {{·}} [[Megalodon.jp]] {{·}} [[OldApps.com]] {{·}} [[OldVersion.com]] {{·}} [[OSBetaArchive]] {{·}} [[TEXTFILES.COM]] {{·}} [[The Dead, the Dying & The Damned]] {{·}} [[The Mail Archive]] {{·}} [[UK Web Archive]] {{·}} [[WebCite]] {{·}} [[Vaporwave.me]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''Blogging''' || [[Blog.pl]] {{·}} [[Blogger]] {{·}} [[Blogster]] {{·}} [[Blogter.hu]] {{·}} [[Freeblog.hu]] {{·}} [[Fuelmyblog]] {{·}} [[Jux]] {{·}} [[LiveJournal]] {{·}} [[My Opera]] {{·}} [[Nolblog.hu]] {{·}} [[Open Diary]] {{·}} [[ownlog.com]] {{·}} [[Posterous]] {{·}} [[Powerblogs]] {{·}} [[Proust]] {{·}} [[Roon]] {{·}} [[Splinder]] {{·}} [[Tumblr]] {{·}} [[Vox]] {{·}} [[Weblog.nl]] {{·}} [[Windows Live Spaces]] {{·}} [[Wordpress.com]] {{·}} [[Xanga]] {{·}} [[Yahoo! Blog]] {{·}} [[Zapd]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''[[Clown hosting|Cloud hosting]]/file sharing''' || [[ADrive|aDrive]] {{·}} [[AnyHub]] {{·}} [[Box]] {{·}} [[Dropbox]] {{·}} [[Docstoc]] {{·}} [[Google Drive]] {{·}} [[Google Groups Files]] {{·}} [[iCloud]] {{·}} [[Fileplanet]] {{·}} [[LayerVault]] {{·}} [[MediaCrush]] {{·}} [[MediaFire]] {{·}} [[Mega]] {{·}} [[MegaUpload]] {{·}} [[MobileMe]] {{·}} [[OneDrive]] {{·}} [[Pomf.se]] {{·}} [[RapidShare]] {{·}} [[Ubuntu One]] {{·}} [[Yahoo! Briefcase]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''[[:Category:Corporations|Corporations]]''' || [[Apple]] {{·}} [[IBM]] {{·}} [[Google]] {{·}} [[Lycos Europe]] {{·}} [[Microsoft]] {{·}} [[Yahoo!]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''Events''' || [[Arab Spring]] {{·}} [[Great Ape-Snake War]] {{·}} [[Spanish Revolution]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''Font Repos''' || [[Google Web Fonts]] {{·}} [[GNU FreeFont]] {{·}} [[Fontspace]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''Forums/Message boards''' || colspan=2 | [[4chan]] {{·}} [[Captain Luffy Forums]] {{·}} [[College Confidential]] {{·}} [[DSLReports]] {{·}} [[ESPN Forums]] {{·}} [[forums.starwars.com]] {{·}} [[HeavenGames]] {{·}} [[Invisionfree]] {{·}} [[The Classic Horror Film Board]] {{·}} [[Yahoo! Messages]] {{·}} [[Yahoo! Neighbors]] {{·}} [[Yuku.com]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''[[Games|Gaming]]''' || colspan=2 | [[Atomicgamer]] {{·}} [[City of Heroes]] {{·}} [[Club Nintendo]] {{·}} [[CSGO Lounge|CS:GO Lounge]] {{·}} [[Desura]] {{·}} [[Dota 2 Lounge]] {{·}} [[Emulation Zone]] {{·}} [[GameMaker Sandbox]] {{·}} [[GameTrailers]] {{·}} [[Halo]] {{·}} [[HLTV.org]] {{·}} [[Infinite Crisis]] {{·}} [[Minecraft.net]] {{·}} [[Player.me]] {{·}} [[Playfire]] {{·}} [[Steam]] {{·}} [[SteamDB]] {{·}} [[TF2 Outpost]] {{·}} [[Warhammer]] {{·}} [[Xfire]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''[[Image hosting]]''' || [[500px]] {{·}} [[AOL Pictures]] {{·}} [[Blipfoto]] {{·}} [[Blingee]] {{·}} [[Canv.as]] {{·}} [[Camera+]] {{·}} [[Cameroid]] {{·}} [[DailyBooth]] {{·}} [[Degree Confluence Project]] {{·}} [[deviantART]] {{·}} [[Demotivalo.net]] {{·}} [[Flickr]] {{·}} [[Fotoalbum.hu]] {{·}} [[Fotolog.com]] {{·}} [[Fotopedia]] {{·}} [[Frontback]] {{·}} [[Geograph Britain and Ireland]] {{·}} [[GTF Képhost]] {{·}} [[ImageShack]] {{·}} [[Imgur]] {{·}} [[Inkblazers]] {{·}} [[Instagr.am]] {{·}} [[Kepfeltoltes.hu]] {{·}} [[Kephost.com]] {{·}} [[Kephost.hu]] {{·}} [[Kepkezelo.com]] {{·}} [[Keptarad.hu]] {{·}} [[Madden GIFERATOR]] {{·}} [[MLKSHK]] {{·}} [[Microsoft Clip Art]] {{·}} [[Nokia Memories]] {{·}} [[noob.hu]] {{·}} [[Odysee]] {{·}} [[Panoramio]] {{·}} [[Photobucket]] {{·}} [[Picasa]] {{·}} [[Picplz]] {{·}} [[PSharing]] {{·}} [[Ptch]] {{·}} [[puu.sh]] {{·}} [[Rawporter]] {{·}} [[Relay.im]] {{·}} [[ScreenshotsDatabase.com]] {{·}} [[Snapjoy]] {{·}} [[Streetfiles]] {{·}} [[Tabblo]] {{·}} [[Trovebox]] {{·}} [[TwitPic]] {{·}} [[Wallbase]] {{·}} [[Wallhaven]] {{·}} [[Webshots]] {{·}} [[Wikimedia Commons]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''Knowledge/[[Wikis]]''' || colspan=2 | [[arXiv]] {{·}} [[Citizendium]] {{·}} [[Clipboard.com]] {{·}} [[Deletionpedia]] {{·}} [[EditThis]] {{·}} [[Encyclopedia Dramatica]] {{·}} [[Etherpad]] {{·}} [[Everything2]] {{·}} [[infoAnarchy]] {{·}} [[GeoNames]] {{·}} [[GNUPedia]] {{·}} [[Google Books]] ([[Google Books Ngram]]) {{·}} [[Horror Movie Database]] {{·}} [[Insurgency Wiki]] {{·}} [[Knol]] {{·}} [[Library Genesis]] {{·}} [[Lost Media Wiki]] {{·}} [[Neoseeker.com]] {{·}} [[Notepad.cc]] {{·}} [[Nupedia]] {{·}} [[OpenCourseWare]] {{·}} [[OpenStreetMap]] {{·}} [[Orain]] {{·}} [[Pastebin]] {{·}} [[Patch.com]] {{·}} [[Project Gutenberg]] {{·}} [[Puella Magi]] {{·}} [[Referata]] {{·}} [[Resedagboken]] {{·}} [[SongMeanings]] {{·}} [[ShoutWiki]] {{·}} [[The Internet Movie Database]] {{·}} [[TropicalWikis]] {{·}} [[Uncyclopedia]] {{·}} [[Urban Dictionary]] {{·}} [[Webmonkey]] {{·}} [[Wikia]] {{·}} [[Wikidot]] {{·}} [[WikiHow]] {{·}} [[Wikkii]] {{·}} [[WikiLeaks]] {{·}} [[Wikipedia]] ([[Simple English Wikipedia]]) {{·}} [[Wikispaces]] {{·}} [[Wikispot]] {{·}} [[Wik.is]] {{·}} [[Wiki-Site]] {{·}} [[WikiTravel]] {{·}} [[Word Count Journal]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''Magazines/Blogs/News''' || colspan=2 | [[Cyberpunkreview.com]] {{·}} [[Game Developer Magazine]] {{·}} [[Gigaom]] {{·}} [[Helium]] {{·}} [[JPG Magazine]] {{·}} [[Polygamia.pl]] {{·}} [[San Fransisco Bay Guardian]] {{·}} [[Scoop]] {{·}} [[Regretsy]] {{·}} [[Yahoo! Voices]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''[[Microblogging]]''' || colspan=2 | [[Heello]] {{·}} [[Identi.ca]] {{·}} [[Jaiku]] {{·}} [[Mommo.hu]] {{·}} [[Plurk]] {{·}} [[Sina Weibo]] {{·}} [[Twitter]] {{·}} [[TwitLonger]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''Music/Audio''' || colspan=2 | [[AOL Music]] {{·}} [[Audimated.com]] {{·}} [[Cinch]] {{·}} [[digCCmixter]] {{·}} [[Dogmazic.net]] {{·}} [[Earbits]] {{·}} [[exfm]] {{·}} [[Free Music Archive]] {{·}} [[Gogoyoko]] {{·}} [[Indaba Music]] {{·}} [[Instacast]] {{·}} [[Jamendo]] {{·}} [[Last.fm]] {{·}} [[Music Unlimited]] {{·}} [[MOG]] {{·}} [[PureVolume]] {{·}} [[Reverbnation]] {{·}} [[ShareTheMusic]] {{·}} [[SoundCloud]] {{·}} [[Soundpedia]] {{·}} [[This Is My Jam]] {{·}} [[TuneWiki]] {{·}} [[Twaud.io]] {{·}} [[WinAmp]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''People''' || colspan=2 | [[Aaron Swartz]] {{·}} [[Michael S. Hart]] {{·}} [[Steve Jobs]] {{·}} [[Mark Pilgrim]] {{·}} [[Dennis Ritchie]] {{·}} [[Len Sassaman Project]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''Protocols/Infrastructure''' || colspan=2 | [[FTP]] {{·}} [[Gopher]] {{·}} [[IRC]] {{·}} [[Usenet]] {{·}} [[World Wide Web]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''Q&A''' || colspan=2 | [[Askville]] {{·}} [[Answerbag]] {{·}} [[Answers.com]] {{·}} [[Ask.com]] {{·}} [[Askalo]] {{·}} [[Baidu Knows]] {{·}} [[Blurtit]] {{·}} [[ChaCha]] {{·}} [[Experts Exchange]] {{·}} [[Formspring]] {{·}} [[GirlsAskGuys]] {{·}} [[Google Answers]] {{·}} [[Google Baraza]] {{·}} [[JustAnswer]] {{·}} [[MetaFilter]] {{·}} [[Quora]] {{·}} [[Retrospring]] {{·}} [[StackExchange]] {{·}} [[The AnswerBank]] {{·}} [[The Internet Oracle]] {{·}} [[Uclue]] {{·}} [[WikiAnswers]] {{·}} [[Yahoo! Answers]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''Recipes/Food''' || colspan=2 | [[Allrecipes]] {{·}} [[Epicurious]] {{·}} [[Food.com]] {{·}} [[Foodily]] {{·}} [[Food Network]] {{·}} [[Punchfork]] {{·}} [[ZipList]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''Social bookmarking''' || colspan=2 | [[Addinto]] {{·}} [[Backflip]] {{·}} [[Balatarin]] {{·}} [[BibSonomy]] {{·}} [[Bkmrx]] {{·}} [[Blinklist]] {{·}} [[BlogMarks]] {{·}} [[BookmarkSync]] {{·}} [[CiteULike]] {{·}} [[Connotea]] {{·}} [[Delicious]] {{·}} [[Designer News]] {{·}} [[Digg]] {{·}} [[Diigo]] {{·}} [[Dir.eccion.es]] {{·}} [[Evernote]] {{·}} [[Excite Bookmark]] {{·}} [[Faves]] {{·}} [[Favilous]] {{·}} [[folkd]] {{·}} [[Freelish]] {{·}} [[Getboo]] {{·}} [[GiveALink.org]] {{·}} [[Gnolia]] {{·}} [[Google Bookmarks]] {{·}} [[Hacker News]] {{·}} [[HeyStaks]] {{·}} [[IndianPad]] {{·}} [[Kippt]] {{·}} [[Knowledge Plaza]] {{·}} [[Licorize]] {{·}} [[Linkwad]] {{·}} [[Menéame]] {{·}} [[Microsoft Developer Network]] {{·}} [[myVIP]] {{·}} [[Mister Wong]] {{·}} [[My Web]] {{·}} [[Mylink Vault]] {{·}} [[Newsvine]] {{·}} [[Oneview]] {{·}} [[Pearltrees]] {{·}} [[Pinboard]] {{·}} [[Pocket]] {{·}} [[Propeller.com]] {{·}} [[Reddit]] {{·}} [[sabros.us]] {{·}} [[Scloog]] {{·}} [[Scuttle]] {{·}} [[Simpy]] {{·}} [[SiteBar]] {{·}} [[Slashdot]] {{·}} [[Squidoo]] {{·}} [[StumbleUpon]] {{·}} [[Twine]] {{·}} [[Vizited]] {{·}} [[Yummymarks]] {{·}} [[Xmarks]] {{·}} [[Yahoo! Buzz]] {{·}} [[Zootool]] {{·}} [[Zotero]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''[[Social network|Social networks]]''' || colspan=2 | [[Bebo]] {{·}} [[BlackPlanet]] {{·}} [[Classmates.com]] {{·}} [[Cyworld]] {{·}} [[Dogster]] {{·}} [[Dopplr]] {{·}} [[douban]] {{·}} [[Ello]] {{·}} [[Facebook]] {{·}} [[Flixster]] {{·}} [[FriendFeed]] {{·}} [[Friendster]] {{·}} [[Friends Reunited]] {{·}} [[Gaia Online]] {{·}} [[Google+]] {{·}} [[Habbo]] {{·}} [[hi5]] {{·}} [[Hyves]] {{·}} [[iWiW]] {{·}} [[LinkedIn]] {{·}} [[Miiverse]] {{·}} [[mixi]] {{·}} [[MyHeritage]] {{·}} [[MyLife]] {{·}} [[Myspace]] {{·}} [[myVIP]] {{·}} [[Netlog]] {{·}} [[Odnoklassniki]] {{·}} [[Orkut]] {{·}} [[Plaxo]] {{·}} [[Qzone]] {{·}} [[Renren]] {{·}} [[Skyrock]] {{·}} [[Sonico.com]] {{·}} [[Storylane]] {{·}} [[Tagged]] {{·}} [[tvtag]] {{·}} [[Upcoming]] {{·}} [[Viadeo]] {{·}} [[Vine]] {{·}} [[Vkontakte]] {{·}} [[WeeWorld]] {{·}} [[Weibo]] {{·}} [[Wretch]] {{·}} [[Yahoo! Groups]] {{·}} [[Yahoo! Stars India]] {{·}} [[Yahoo! Upcoming]] {{·}} [[Social network|more sites...]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''Shopping/Retail''' || colspan=2 | [[Alibaba]] {{·}} [[AliExpress]] {{·}} [[Amazon]] {{·}} [[Apple Store]] {{·}} [[eBay]] {{·}} [[Printfection]] {{·}} [[RadioShack]] {{·}} [[Sears]] {{·}} [[Target]] {{·}} [[The Book Depository]] {{·}} [[ThinkGeek]] {{·}} [[Walmart]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''Software/[[Code hosting services|code hosting]]''' || colspan=2 | [[Android Development]] {{·}} [[Alioth]] {{·}} [[Assembla]] {{·}} [[BerliOS]] {{·}} [[Betavine]] {{·}} [[Bitbucket]] {{·}} [[BountySource]] {{·}} [[Codecademy]] {{·}} [[CodePlex]] {{·}} [[Freepository]] {{·}} [[Free Software Foundation]] {{·}} [[GNU Savannah]] {{·}} [[GitHost]] {{·}} [[GitHub]] {{·}} [[GitHub Downloads]] {{·}} [[Gitorious]] {{·}} [[Gna!]] {{·}} [[Google Code]] {{·}} [[ibiblio]] {{·}} [[java.net]] {{·}} [[JavaForge]] {{·}} [[KnowledgeForge]] {{·}} [[Launchpad]] {{·}} [[LuaForge]] {{·}} [[Maemo]] {{·}} [[mozdev]] {{·}} [[OSOR.eu]] {{·}} [[OW2 Consortium]] {{·}} [[Openmoko]] {{·}} [[OpenSolaris]] {{·}} [[Ourproject.org]] {{·}} [[Ovi Store]] {{·}} [[Project Kenai]] {{·}} [[RubyForge]] {{·}} [[SEUL.org]] {{·}} [[SourceForge]] {{·}} [[Stypi]] {{·}} [[TestFlight]] {{·}} [[tigris.org]] {{·}} [[Transifex]] {{·}} [[TuxFamily]] {{·}} [[Yahoo! Downloads]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''Torrenting/Piracy''' || colspan=2 | [[ExtraTorrent]] {{·}} [[EZTV]] {{·}} [[isoHunt]] {{·}} [[KickassTorrents]] {{·}} [[The Pirate Bay]] {{·}} [[Torrentz]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''[[Video hosting]]''' || colspan=2 | [[Academic Earth]] {{·}} [[Blip.tv]] {{·}} [[Epic]] {{·}} [[Google Video]] {{·}} [[Justin.tv]] {{·}} [[Niconico]] {{·}} [[Nokia Trailers]] {{·}} [[Qwiki]] {{·}} [[Skillfeed]] {{·}} [[Stickam]] {{·}} [[TED Talks]] {{·}} [[Ticker.tv]] {{·}} [[Twitch.tv]] {{·}} [[Ustream]] {{·}} [[Videoplayer.hu]] {{·}} [[Viddler]] {{·}} [[Viddy]] {{·}} [[Vimeo]] {{·}} [[Vine]] {{·}} [[Vstreamers]] {{·}} [[Yahoo! Video]] {{·}} [[YouTube]] {{·}} [[Famous Internet videos]] ([[Me at the zoo]])<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''[[List of website hosts|Web hosting]]''' || [[Angelfire]] {{·}} [[Brace.io]] {{·}} [[BT Internet]] {{·}} [[CableAmerica Personal Web Space]] {{·}} [[Claranet Netherlands Personal Web Pages]] {{·}} [[Comcast Personal Web Pages]] {{·}} [[Extra.hu]] {{·}} [[FortuneCity]] {{·}} [[Free ProHosting]] {{·}} [[GeoCities]] ([[GeoCities Torrent Patch|patch]]) {{·}} [[Google Business Sitebuilder]] {{·}} [[Google Sites]] {{·}} [[Internet Centrum]] {{·}} [[MBinternet]] {{·}} [[MSN TV]] {{·}} [[Nwnyet]] {{·}} [[Parodius Networking]] {{·}} [[Prodigy.net]] {{·}} [[Saunalahti Iso G]] {{·}} [[Swipnet]] {{·}} [[Telenor]] {{·}} [[Tripod]] {{·}} [[University of Michigan personal webpages]] {{·}} [[Verizon Mysite]] {{·}} [[Verizon Personal Web Space]] {{·}} [[Webzdarma]] {{·}} [[Virgin Media]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''Web applications''' || colspan=2 | [[Mailman]] {{·}} [[MediaWiki]] {{·}} [[phpBB]] {{·}} [[Simple Machines Forum]] {{·}} [[vBulletin]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''Other''' || colspan=2 | [[800notes]] {{·}} [[AOL]] {{·}} [[Akoha]] {{·}} [[Ancestry.com]] {{·}} [[April Fools' Day]] {{·}} [[Amplicate]] {{·}} [[AutoAdmit]] {{·}} [[Bre.ad]] {{·}} [[Circavie]] {{·}} [[Cobook]] {{·}} [[Co.mments]] {{·}} [[Countdown]] {{·}} [[Distill]] {{·}} [[Dmoz]] {{·}} [[Easel]] {{·}} [[Eircode]] {{·}} [[Electronic Frontier Foundation]] {{·}} [[FanFiction.Net]] {{·}} [[Feedly]] {{·}} [[Ficlets]] {{·}} [[Forrst]] {{·}} [[FunnyExam.com]] {{·}} [[FurAffinity]] {{·}} [[Google Helpouts]] {{·}} [[Google Moderator]] {{·}} [[Google Reader]] {{·}} [[ICQmail]] {{·}} [[IFTTT]] {{·}} [[Jajah]] {{·}} [[JuniorNet]] {{·}} [[Lulu Poetry]] {{·}} [[Mobile Phone Applications]] {{·}} [[Mochi Media]] {{·}} [[Mozilla Firefox]] {{·}} [[MyBlogLog]] {{·}} [[NBII]] {{·}} [[Neopets]] {{·}} [[Quantcast]] {{·}} [[Quizilla]] {{·}} [[Salon Table Talk]] {{·}} [[Shutdownify]] {{·}} [[Slidecast]] {{·}} [[SOPA blackout pages]] {{·}} [[starwars.yahoo.com]] {{·}} [[TechNet]] {{·}} [[Toshiba Support]] {{·}} [[USA-Gov]] {{·}} [[Volán]] {{·}} [[Widgetbox]] {{·}} [[Windows Technical Preview]] {{·}} [[Wunderlist]] {{·}} [[Zoocasa]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''Information''' || colspan=2 | [[A Million Ways to Die on the Web]] {{·}} [[Backup Tips]] {{·}} [[Cheap storage]] {{·}} [[Collecting items randomly]] {{·}} [[Data compression algorithms and tools]] {{·}} [[Dev]] {{·}} [[Discovery Data]] {{·}} [[DOS Floppies]] {{·}} [[Fortress of Solitude]] {{·}} [[Keywords]] {{·}} [[Naughty List]] {{·}} [[Nightmare Projects]] {{·}} [[Rescuing Floppy Disks|Rescuing floppy disks]] {{·}} [[Rescuing optical media]] {{·}} [[Site exploration]] {{·}} [[The WARC Ecosystem]] {{·}} [[Working with ARCHIVE.ORG]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''[[Projects]]''' || colspan=2 | [[ArchiveCorps]] {{·}} [[Audit2014]] {{·}} [[Emularity]] {{·}} [[Faceoff]] {{·}} [[FlickrFckr]] {{·}} [[Froogle]] {{·}} [[ftp-gov]] {{·}} [[INTERNETARCHIVE.BAK]] ([[Internet Archive Census]]) {{·}} [[IRC Quotes]] {{·}} [[Javascript Mess|JSMESS]] {{·}} [[Jsvlc|JSVLC]] {{·}} [[Just Solve the Problem 2012|Just Solve the Problem]] {{·}} [[NewsGrabber]] {{·}} [[Project Newsletter]] {{·}} [[Valhalla]] {{·}} [[Web Roasting]] ([[ISP Hosting]] {{·}} [[University Web Hosting]]) {{·}} [[Woohoo]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''[[Software|Tools]]''' || colspan=2 | [[ArchiveBot]] {{·}} [[ArchiveTeam Warrior]] ([[Tracker]]) {{·}} [[Google Takeout]] {{·}} [[HTTrack options|HTTrack]] {{·}} [[Video|Video downloaders]] {{·}} [[Wget]] ([[Wget with Lua hooks|Lua]] {{·}} [[Wget with WARC output|WARC]])<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''Teams''' || colspan=2 | [[Bibliotheca Anonoma]] {{·}} [[LibreTeam]] {{·}} [[URLTeam]] {{·}} [[Yahoo Video Warroom]] {{·}} [[WikiTeam]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''About [[Archive Team]]''' || colspan=2 | [[Introduction]] {{·}} [[Philosophy]] {{·}} [[Who We Are]] {{·}} [[Robots.txt|Our stance on robots.txt]] {{·}} [[Why Back Up?]] {{·}} [[Software]] {{·}} [[Formats]] {{·}} [[Storage Media]] {{·}} [[Recommended Reading]] {{·}} [[Films and documentaries about archiving]] {{·}} [[Talks]] {{·}} [[In The Media]] {{·}} [[Frequently Asked Questions|FAQ]]<br />
|}<br />
</center>[[Category:Archive Team]]<noinclude>[[Category:Templates]]</noinclude></div>Yipdwhttps://wiki.archiveteam.org/index.php?title=Main_Page/In_The_Media&diff=27374Main Page/In The Media2017-01-16T15:47:15Z<p>Yipdw: Undo revision 26842 by Megalanya0 (talk)</p>
<hr />
<div><!-- This should contain the top 10 most recent items from In The Media --><br />
;*[https://www.youtube.com/watch?v=NdZxI3nFVJs ''Digital Amnesia'']<br />
:Bregtje van der Haak, ''VPRO Backlight'', 2014-09-11<br />
:(Also available at the [https://archive.org/details/DigitalAmnesiaDocumentary Internet Archive])<br />
;*[http://webwereld.nl/e-commerce/80283-hyves-redders-redden-het-niet-of-nipt Hyves-redders redden het niet óf nipt]<br />
:Jasper Bakker, ''Webwereld'', 2013-11-29<br />
;*[http://arstechnica.com/tech-policy/2013/10/isohunt-shuts-down-a-day-early-to-avoid-becoming-part-of-online-archive/ ''isoHunt shuts down a day early to avoid becoming part of online archive'']<br />
:Cyrus Farivar, ''Ars Technica'', 2013-10-21<br />
;*[https://torrentfreak.com/archiveteam-works-hard-to-avert-isohunt-data-massacre-131020/ ''ArchiveTeam Works Hard to Avert isoHunt Data Massacre'']<br />
:Ernesto, ''TorrentFreak'', 2013-10-20<br />
;*[http://www.chip.pl/artykuly/trendy/2013/06/cyfrowi-archeolodzy ''Cyfrowi archeolodzy WWW'']<br />
:Hieronim Walicki, ''CHIP.pl'', 2013-06-18<br />
;*[http://blogs.loc.gov/digitalpreservation/2013/06/and-the-winner-is-announcing-the-2013-ndsa-innovation-award-winners/ And the Winner Is… Announcing the 2013 NDSA Innovation Award Winners]<br />
:Trevor Owens, ''The Signal'', Library of Congress, 2013-06-11<br />
;*[http://www.rtve.es/noticias/20130527/archive-team-superheroes-evitan-webs-caigan-olvido/673120.shtml ''Archive Team: los superhéroes que evitan que las webs caigan en el olvido''] <br />
:Alvaro Ibanez, rtve.es, 2013-05-27. (Rough translation: ''Archive Team: superheroes that prevent us from falling into oblivion''.)<br />
;*[http://techcrunch.com/2013/04/22/want-to-help-archive-upcoming-org-before-yahoo-shuts-it-down-try-this/ ''Want To Help Archive Upcoming.org Before Yahoo Shuts It Down? Try This.'']<br />
:Sarah Perez, ''TechCrunch'', 2013-04-22<br />
;*[http://www.huffingtonpost.com/2013/03/27/jason-scott-archive-team_n_2965368.html ''Jason Scott's Archive Team Is Saving The Web From Itself (And Rescuing Your Stuff)'']<br />
:Bianca Bosker, ''The Huffington Post'', 2013-03-27<br />
;*[http://www.dailydot.com/news/archive-team-preserving-posterous/ Archive Team races to preserve Posterous before it goes dark]<br />
:Kris Holt, ''The Daily Dot'', 2013-03-13<br />
<br />
;[[In The Media|More...]]</div>Yipdwhttps://wiki.archiveteam.org/index.php?title=Main_Page/In_The_Media&diff=27372Main Page/In The Media2017-01-16T15:47:03Z<p>Yipdw: Undo revision 27135 by Megalanya0 (talk)</p>
<hr />
<div>'''MOTHERFUCKER ! ! !'''<br />
<br />
'''MOTHERFUCKER ! ! !'''<br />
<br />
'''MOTHERFUCKER ! ! !'''</div>Yipdwhttps://wiki.archiveteam.org/index.php?title=Vine&diff=26546Vine2016-11-15T10:01:38Z<p>Yipdw: Add Vine video/user database</p>
<hr />
<div>{{Infobox project<br />
| title = Vine<br />
| logo = Vine Logo.png<br />
| image = Vine desktop.png<br />
| description = Vine's desktop home page in late 2016<br />
| URL = https://vine.co/<br />
| project_status = Application: {{closing}}, Videos: {{endangered}}<br />
| archiving_status = {{Upcoming}}<br />
| irc = vinewhine<br />
}}<br />
<br />
'''Vine''' is a short-form video sharing service where users can share six-second-long looping video clips. The service was founded in June 2012, and American microblogging website Twitter acquired it in October 2012, just before its official launch. Users' videos are published through Vine's social network and can be shared on other services such as Facebook and Twitter.[https://en.wikipedia.org/wiki/Vine_%28service%29]<br />
<br />
On October 27th, 2016, the Vine Team & Twitter announced that the mobile app would be discontinued in the coming months, and that the current videos and content on the site will remain up and available for the time being.[https://medium.com/@vine/important-news-about-vine-909c5f4ae7a7#.eq93x8qvv]<br />
<br />
As this information is slightly vague, it's best to take it with a grain of salt and make attempts to grab and save videos from Vine. A form for submitting specific vines to include has been set up [https://docs.google.com/forms/d/1otnjcaABgkxeao9GZKuGcjQlI_SUfHQMNRAyaKw-lZI/viewform?edit_requested=true here].<br />
<br />
We are also working on a bot to archive vines that are tagged to us:<br />
<br />
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">GOOD NEWS! Through the magic of programming, you can now link your favorite Vines in a tweet mentioning @archiveteam and we&#39;ll pick &#39;em up.</p>&mdash; Archive Team (@archiveteam) [https://twitter.com/archiveteam/status/792162776051490816 October 29, 2016]</blockquote><br />
<br />
You can see all the videos and users we currently know about here: http://lothlorien.peach-bun.com:15984/_utils/database.html?vine/_design/videos/_view/work_items_by_created_at<br />
<br />
* [https://mic.com/articles/157977/inside-the-secret-meeting-that-changed-the-fate-of-vine-forever Inside the Secret Meeting that Changed the Fate of Vine Forever], Tech.Mic, October 29, 2016<br />
* [http://www.npr.org/sections/codeswitch/2016/10/28/499681576/vine-ending-grew-black-brown-talent A Moment Of Silence For The Black And Brown Talent That Grew On Vine], NPR, October 28, 2016<br />
* [https://medium.com/@vine/important-news-about-vine-909c5f4ae7a7#.zg1nunjye Important News about Vine - Today, we are sharing the news that in the coming months we’ll be discontinuing the mobile app.], Medium - October 28, 2016<br />
{{navigation box}}</div>Yipdwhttps://wiki.archiveteam.org/index.php?title=User:Yipdw&diff=25864User:Yipdw2016-06-27T05:54:36Z<p>Yipdw: Remove ArchiveBot-related contact details</p>
<hr />
<div></div>Yipdwhttps://wiki.archiveteam.org/index.php?title=Reddit&diff=23833Reddit2015-07-03T15:28:09Z<p>Yipdw: Undo revision 23830 by Yipdw (talk)</p>
<hr />
<div>{{Infobox project<br />
| title = reddit<br />
| image = Reddit home page 2013-03-26.png<br />
| description = reddit home page as seen on March 26, 2013<br />
| URL = http://www.reddit.com/<br />
| project_status = {{endangered}}<br />
| archiving_status = {{inprogress}} (through [[ArchiveBot]])<br />
| irc = deaddit<br />
}}<br />
<br />
'''reddit''' is a content aggregator and social bookmarking service similar to the likes of Digg. Users can submit links, submit text posts, vote and comment on submissions in communities called "subreddits". It received considerable attention from its twelve hour SOPA blackout early in January of 2012.<br />
<br />
== Vital signs ==<br />
<br />
<s>Appears stable, though a small to medium size team is a concern.<br />
<br />
'''Update (6/10/15)''': the admins carried out bannings of several subreddits claiming they were harassing people, the most notable of which was /r/fatpeoplehate. This has instilled some fear, uncertainty, and doubt in some part of the userbase, with a few claiming that reddit will soon become what Digg is now: nearly dead.</s><br />
<br />
'''Extremely endangered - many subreddits are picketing after the firing of a reddit employee named Victoria by turning themselves private or restricting submissions.'''<br />
<br />
== Data liberation ==<br />
<br />
Currently (as of March 26, 2013), users can only see up to 1,000 posts and comments on a profile page. However, it was stated by admin "spladug" [http://www.reddit.com/r/ideasfortheadmins/comments/10tai6/ever_wondered_the_data_liberation_policy_of_reddit/c6gicdf that older comments and posts are still in the database]. spladug also states that the team is in favor for retrieving dumps of a user's data, but that the task would be taxing on the servers. Since this comment was posted, there appears to have been no progress on a dump system. Archiving would be nearly impossible using the old-fashioned way (without wget) if things do wind up FUBAR in the future because of this limitation.<br />
<br />
No further progress appears to have been made since then as of June 2015.<br />
<br />
== External Links ==<br />
<br />
* {{url|1=http://www.reddit.com|2=reddit}}<br />
<br />
{{Navigation box}}</div>Yipdwhttps://wiki.archiveteam.org/index.php?title=Reddit&diff=23832Reddit2015-07-03T15:27:57Z<p>Yipdw: Undo revision 23831 by Yipdw (talk)</p>
<hr />
<div>{{Infobox project<br />
| title = reddit<br />
| image = Reddit home page 2013-03-26.png<br />
| description = reddit home page as seen on March 26, 2013<br />
| URL = http://www.reddit.com/<br />
| project_status = {{endangered}}<br />
| archiving_status = {{Needs archiving}}<br />
| irc = deaddit<br />
}}<br />
<br />
'''reddit''' is a content aggregator and social bookmarking service similar to the likes of Digg. Users can submit links, submit text posts, vote and comment on submissions in communities called "subreddits". It received considerable attention from its twelve hour SOPA blackout early in January of 2012.<br />
<br />
== Vital signs ==<br />
<br />
<s>Appears stable, though a small to medium size team is a concern.<br />
<br />
'''Update (6/10/15)''': the admins carried out bannings of several subreddits claiming they were harassing people, the most notable of which was /r/fatpeoplehate. This has instilled some fear, uncertainty, and doubt in some part of the userbase, with a few claiming that reddit will soon become what Digg is now: nearly dead.</s><br />
<br />
'''Extremely endangered - many subreddits are picketing after the firing of a reddit employee named Victoria by turning themselves private or restricting submissions.'''<br />
<br />
== Data liberation ==<br />
<br />
Currently (as of March 26, 2013), users can only see up to 1,000 posts and comments on a profile page. However, it was stated by admin "spladug" [http://www.reddit.com/r/ideasfortheadmins/comments/10tai6/ever_wondered_the_data_liberation_policy_of_reddit/c6gicdf that older comments and posts are still in the database]. spladug also states that the team is in favor for retrieving dumps of a user's data, but that the task would be taxing on the servers. Since this comment was posted, there appears to have been no progress on a dump system. Archiving would be nearly impossible using the old-fashioned way (without wget) if things do wind up FUBAR in the future because of this limitation.<br />
<br />
No further progress appears to have been made since then as of June 2015.<br />
<br />
== External Links ==<br />
<br />
* {{url|1=http://www.reddit.com|2=reddit}}<br />
<br />
{{Navigation box}}</div>Yipdwhttps://wiki.archiveteam.org/index.php?title=Reddit&diff=23831Reddit2015-07-03T15:24:13Z<p>Yipdw: </p>
<hr />
<div>{{Infobox project<br />
| title = reddit<br />
| image = Reddit home page 2013-03-26.png<br />
| description = reddit home page as seen on March 26, 2013<br />
| URL = http://www.reddit.com/<br />
| project_status = {{endangered}}<br />
| archiving_status = {{upcoming}}<br />
| irc = deaddit<br />
}}<br />
<br />
'''reddit''' is a content aggregator and social bookmarking service similar to the likes of Digg. Users can submit links, submit text posts, vote and comment on submissions in communities called "subreddits". It received considerable attention from its twelve hour SOPA blackout early in January of 2012.<br />
<br />
== Vital signs ==<br />
<br />
<s>Appears stable, though a small to medium size team is a concern.<br />
<br />
'''Update (6/10/15)''': the admins carried out bannings of several subreddits claiming they were harassing people, the most notable of which was /r/fatpeoplehate. This has instilled some fear, uncertainty, and doubt in some part of the userbase, with a few claiming that reddit will soon become what Digg is now: nearly dead.</s><br />
<br />
'''Extremely endangered - many subreddits are picketing after the firing of a reddit employee named Victoria by turning themselves private or restricting submissions.'''<br />
<br />
== Data liberation ==<br />
<br />
Currently (as of March 26, 2013), users can only see up to 1,000 posts and comments on a profile page. However, it was stated by admin "spladug" [http://www.reddit.com/r/ideasfortheadmins/comments/10tai6/ever_wondered_the_data_liberation_policy_of_reddit/c6gicdf that older comments and posts are still in the database]. spladug also states that the team is in favor for retrieving dumps of a user's data, but that the task would be taxing on the servers. Since this comment was posted, there appears to have been no progress on a dump system. Archiving would be nearly impossible using the old-fashioned way (without wget) if things do wind up FUBAR in the future because of this limitation.<br />
<br />
No further progress appears to have been made since then as of June 2015.<br />
<br />
== External Links ==<br />
<br />
* {{url|1=http://www.reddit.com|2=reddit}}<br />
<br />
{{Navigation box}}</div>Yipdwhttps://wiki.archiveteam.org/index.php?title=Reddit&diff=23830Reddit2015-07-03T15:23:39Z<p>Yipdw: If you're going to actually download reddit you probably want something bigger than ArchiveBot</p>
<hr />
<div>{{Infobox project<br />
| title = reddit<br />
| image = Reddit home page 2013-03-26.png<br />
| description = reddit home page as seen on March 26, 2013<br />
| URL = http://www.reddit.com/<br />
| project_status = {{endangered}}<br />
| archiving_status = {{Needs archiving}}<br />
| irc = deaddit<br />
}}<br />
<br />
'''reddit''' is a content aggregator and social bookmarking service similar to the likes of Digg. Users can submit links, submit text posts, vote and comment on submissions in communities called "subreddits". It received considerable attention from its twelve hour SOPA blackout early in January of 2012.<br />
<br />
== Vital signs ==<br />
<br />
<s>Appears stable, though a small to medium size team is a concern.<br />
<br />
'''Update (6/10/15)''': the admins carried out bannings of several subreddits claiming they were harassing people, the most notable of which was /r/fatpeoplehate. This has instilled some fear, uncertainty, and doubt in some part of the userbase, with a few claiming that reddit will soon become what Digg is now: nearly dead.</s><br />
<br />
'''Extremely endangered - many subreddits are picketing after the firing of a reddit employee named Victoria by turning themselves private or restricting submissions.'''<br />
<br />
== Data liberation ==<br />
<br />
Currently (as of March 26, 2013), users can only see up to 1,000 posts and comments on a profile page. However, it was stated by admin "spladug" [http://www.reddit.com/r/ideasfortheadmins/comments/10tai6/ever_wondered_the_data_liberation_policy_of_reddit/c6gicdf that older comments and posts are still in the database]. spladug also states that the team is in favor for retrieving dumps of a user's data, but that the task would be taxing on the servers. Since this comment was posted, there appears to have been no progress on a dump system. Archiving would be nearly impossible using the old-fashioned way (without wget) if things do wind up FUBAR in the future because of this limitation.<br />
<br />
No further progress appears to have been made since then as of June 2015.<br />
<br />
== External Links ==<br />
<br />
* {{url|1=http://www.reddit.com|2=reddit}}<br />
<br />
{{Navigation box}}</div>Yipdwhttps://wiki.archiveteam.org/index.php?title=Dev/Tracker&diff=22890Dev/Tracker2015-04-19T07:25:46Z<p>Yipdw: Document hiredis' expectation to find node.js as node; simplify npm install</p>
<hr />
<div>This article describes how to set up your own '''[[tracker]]''' just like the official Archive Team tracker. Use this guide only if you want to do a full test of the infrastructure.<br />
<br />
'''Note:''' A virtual machine appliance is available at [https://github.com/ArchiveTeam/archiveteam-dev-env ArchiveTeam/archiveteam-dev-env] which contains a ready-to-use tracker.<br />
<br />
Installation will cover:<br />
<br />
* Environment: Ubuntu/Debian<br />
* Languages:<br />
** Python<br />
** Ruby<br />
** JavaScript<br />
* Web: <br />
** Nginx<br />
** Phusion Passenger<br />
** Redis<br />
** Node.js<br />
* Tools:<br />
** Screen<br />
** Rsync<br />
** Git<br />
** Wget<br />
** regular expressions<br />
<br />
== The Tracker ==<br />
<br />
The Tracker manages what items are claimed by users that run the Seesaw client. It also shows a pretty leaderboard.<br />
<br />
Let's create a dedicated account to run the web server and tracker:<br />
<br />
sudo adduser --system --group --shell /bin/bash tracker<br />
<br />
=== Redis ===<br />
<br />
Redis is database stored in memory. So, item names should be engineered to be memory efficient. Redis saves its database periodically into a file located at /var/lib/redis/6379/dump.rdb. It is safe to copy the file, e.g., for backups.<br />
<br />
To install Redis, you may follow these [http://redis.io/topics/quickstart quickstart instructions], but we'll show you how.<br />
<br />
These steps are from the quickstart guide:<br />
<br />
wget http://download.redis.io/redis-stable.tar.gz<br />
tar xvzf redis-stable.tar.gz<br />
cd redis-stable<br />
make<br />
<br />
Now install the server:<br />
<br />
sudo make install<br />
cd utils<br />
sudo ./install_server.sh<br />
<br />
Note, by default, it runs as root. Let's stop it and make it run under www-data:<br />
<br />
sudo invoke-rc.d redis_6379 stop<br />
sudo adduser --system --group www-data<br />
sudo chown -R www-data:www-data /var/lib/redis/6379/<br />
sudo chown -R www-data:www-data /var/log/redis_6379.log<br />
<br />
Edit the config file <code>/etc/redis/6379.conf</code> with the options like:<br />
<br />
bind 127.0.0.1<br />
pidfile /var/run/shm/redis_6379.pid<br />
<br />
Now tell the start up script to run it as www-data:<br />
<br />
sudo nano /etc/init.d/redis_6379<br />
<br />
Change the EXEC and CLIEXEC variables to use <code>sudo -u www-data -g www-data</code>:<br />
<br />
EXEC="sudo -u www-data -g www-data /usr/local/bin/redis-server"<br />
CLIEXEC="sudo -u www-data -g www-data /usr/local/bin/redis-cli"<br />
PIDFILE=/var/run/shm/redis_6379.pid<br />
<br />
To avoid catastrophe with background saves failing on <code>fork()</code> (Redis needs lots of memory), run:<br />
<br />
sudo sysctl vm.overcommit_memory=1<br />
<br />
The above setting will be lost after reboot. Add this line to <code>/etc/sysctl.conf</code>:<br />
<br />
vm.overcommit_memory=1<br />
<br />
The log file will get big so we need a logrotate config. Create one at <code>/etc/logrotate.d/redis</code> with the config:<br />
<br />
/var/log/redis_*.log {<br />
daily<br />
rotate 10<br />
copytruncate<br />
delaycompress<br />
compress<br />
notifempty<br />
missingok<br />
size 10M<br />
}<br />
<br />
Start up Redis again using:<br />
<br />
sudo invoke-rc.d redis_6379 start<br />
<br />
=== Nginx with Passenger ===<br />
<br />
Nginx is a web server. Phusion Passenger is a module within Nginx that runs Rails applications.<br />
<br />
There is a [https://www.digitalocean.com/community/articles/how-to-install-rails-and-nginx-with-passenger-on-ubuntu guide] on how to install Nginx with Passenger, the following instructions are similar.<br />
<br />
Log in as tracker:<br />
<br />
sudo -u tracker -i<br />
<br />
We'll use RVM to install Ruby libraries:<br />
<br />
curl -L get.rvm.io | bash -s stable<br />
source ~/.rvm/scripts/rvm<br />
rvm requirements<br />
<br />
A list of things needed to be installed will be shown. Log out of the tracker account, install them, and log back into the tracker account.<br />
<br />
Install Ruby and Bundler:<br />
<br />
rvm install 2.2.2<br />
rvm rubygems current<br />
gem install bundler<br />
<br />
Install Passenger:<br />
<br />
gem install passenger<br />
<br />
Install Nginx. This command will download, compile, and install a basic Nginx server.:<br />
<br />
passenger-install-nginx-module<br />
<br />
Use the following prefix for Nginx installation:<br />
<br />
/home/tracker/nginx/<br />
<br />
Change the location of the tracker software (to be installed later). Edit <code>nginx/conf/nginx.conf</code>. Use the lines under the "location /" option:<br />
<br />
root /home/tracker/universal-tracker/public;<br />
passenger_enabled on;<br />
client_max_body_size 15M;<br />
<br />
The logs will get big so we'll use logrotate. Save this into <code>/home/tracker/logrotate.conf</code>:<br />
<br />
/home/tracker/nginx/logs/error.log<br />
/home/tracker/nginx/logs/access.log {<br />
daily<br />
rotate 10<br />
copytruncate<br />
delaycompress<br />
compress<br />
notifempty<br />
missingok<br />
size 10M<br />
}<br />
<br />
To call logrotate, we'll add an entry using crontab:<br />
<br />
crontab -e<br />
<br />
Now add the following line:<br />
<br />
@daily /usr/sbin/logrotate --state /home/tracker/.logrotate.state /home/tracker/logrotate.conf<br />
<br />
Log out of the tracker account at this point.<br />
<br />
Let's create an Upstart configuration file to start up Nginx. Save this into <code>/etc/init/nginx-tracker.conf</code>:<br />
<br />
description "nginx http daemon"<br />
<br />
start on runlevel [2]<br />
stop on runlevel [016]<br />
<br />
setuid tracker<br />
setgid tracker<br />
<br />
console output<br />
<br />
exec /home/tracker/nginx/sbin/nginx -c /home/tracker/nginx/conf/nginx.conf -g "daemon off;"<br />
<br />
=== Tracker ===<br />
<br />
Log in into the tracker account.<br />
<br />
Download the Tracker software:<br />
<br />
git clone https://github.com/ArchiveTeam/universal-tracker.git<br />
<br />
We'll need to configure the location of Redis. Copy the config file:<br />
<br />
cp universal-tracker/config/redis.json.example universal-tracker/config/redis.json<br />
<br />
Add a "production" object into the JSON file. Here is an example:<br />
<br />
{<br />
"development": {<br />
"host": "127.0.0.1",<br />
"port": 6379,<br />
"db": 13<br />
},<br />
"test": {<br />
"host": "127.0.0.1",<br />
"port": 6379,<br />
"db": 14<br />
},<br />
"production": {<br />
"host":"127.0.0.1",<br />
"port":6379,<br />
"db": 1<br />
}<br />
}<br />
<br />
* Now we may need to fix an issue with Passenger forking after the Redis connection has been made. Please see https://github.com/ArchiveTeam/universal-tracker/issues/5 for more information.<br />
* There is also an issue with non-ASCII names. See https://github.com/ArchiveTeam/universal-tracker/issues/7.<br />
<br />
Now install the necessary gems:<br />
<br />
cd universal-tracker<br />
bundle install<br />
<br />
Log out of the tracker account at this point.<br />
<br />
=== Node.js ===<br />
<br />
Node.js is required to run the fancy leaderboard using WebSockets. We'll use NPM to manage the Node.js libraries:<br />
<br />
sudo apt-get install npm<br />
<br />
Log into the tracker account.<br />
<br />
Now, we manually edit the Node.js program because it has problems:<br />
<br />
cp -R universal-tracker/broadcaster .<br />
nano broadcaster/server.js<br />
<br />
Modify <code>env</code> and <code>trackerConfig</code> variables to something like this:<br />
<br />
var env = {<br />
tracker_config: {<br />
redis_pubsub_channel: "tracker-log"<br />
},<br />
redis_db: 1<br />
};<br />
var trackerConfig = env['tracker_config'];<br />
<br />
You also need to modify the "transports" configuration by adding <code>websocket</code>. The new line should look like this:<br />
<br />
io.set("transports", ["websocket", "xhr-polling"]);<br />
<br />
Install the Node.js libraries needed:<br />
<br />
npm install<br />
<br />
If you get an error while installing hiredis, you may need to provide Debian's "nodejs" as "node". Symlink "node" to the nodejs executable and try again.<br />
<br />
Log out of the tracker account at this point.<br />
<br />
Create an Upstart file at <code>/etc/init/nodejs-tracker.conf</code>:<br />
<br />
description "tracker nodejs daemon"<br />
<br />
start on runlevel [2]<br />
stop on runlevel [016]<br />
<br />
setuid tracker<br />
setgid tracker<br />
<br />
exec node /home/tracker/broadcaster/server.js<br />
<br />
=== Tracker Setup ===<br />
<br />
Start up the Tracker and Broadcaster:<br />
<br />
sudo start nginx-tracker<br />
sudo start nodejs-tracker<br />
<br />
You now need to configure the tracker. Open up your web browser and visit http://localhost/global-admin/.<br />
<br />
* In Global-Admin→Configuration→Live logging host, specify the public location of the Node.js app. By default, it uses port 8080.<br />
<br />
You are now free to manage the tracker.<br />
<br />
Notes:<br />
<br />
* If you followed this guide, the rsync location is defined as <code>rsync://HOSTNAME/PROJECT_NAME/:downloader/</code><br />
* The '''''trailing slash''''' within the rsync URL is very important. Without it, files will not be uploaded within the directory.<br />
<br />
==== Claims ====<br />
<br />
You probably want to have Cron clearing out old claims. The Tracker includes a Ruby script that will do that for you. By default, it removes claims older than 6 hours. You may want to change that for big items by creating a copy of the script for each project.<br />
<br />
To set up Cron, login as the tracker account, and run:<br />
<br />
which ruby<br />
<br />
Take note of which Ruby executable is used.<br />
<br />
Now edit the Cron table:<br />
<br />
crontab -e<br />
<br />
Add the following line which runs <code>release-stale.rb</code> every 6 hours:<br />
<br />
0 */6 * * * cd /home/tracker/universal-tracker && WHICH_RUBY scripts/release-stale.rb PROJECT_NAME<br />
<br />
==== Logs ====<br />
<br />
Since the Tracker stores logs into Redis, it will use up memory quickly. <code>log-drainer.rb</code> continuously writes the logs into a text file:<br />
<br />
mkdir -p /home/tracker/universal-tracker/logs/<br />
cd /home/tracker/universal-tracker && ruby scripts/log-drainer.rb<br />
<br />
Pressing CTRL+C will stop it. Run this within a Screen session.<br />
<br />
This crontab entry will compress the log files that haven't been modified in two days:<br />
<br />
@daily find /home/tracker/universal-tracker/logs/ -iname "*.log" -mtime +2 -exec xz {} \;<br />
<br />
==== Reducing memory usage ====<br />
<br />
The Passenger Ruby module may use up too much memory. You can add the following lines to your nginx config. Add these inside the <code>http</code> block:<br />
<br />
passenger_max_pool_size 2;<br />
passenger_max_requests 10000;<br />
<br />
The first line allows spawning maximum of 2 processes. The second line restarts Passenger after 10,000 requests to free memory caused by memory leaks.<br />
<br />
{{devnav}}</div>Yipdwhttps://wiki.archiveteam.org/index.php?title=Dev/Tracker&diff=22889Dev/Tracker2015-04-18T22:36:14Z<p>Yipdw: Use Ruby 2.2.2; it's latest stable</p>
<hr />
<div>This article describes how to set up your own '''[[tracker]]''' just like the official Archive Team tracker. Use this guide only if you want to do a full test of the infrastructure.<br />
<br />
'''Note:''' A virtual machine appliance is available at [https://github.com/ArchiveTeam/archiveteam-dev-env ArchiveTeam/archiveteam-dev-env] which contains a ready-to-use tracker.<br />
<br />
Installation will cover:<br />
<br />
* Environment: Ubuntu/Debian<br />
* Languages:<br />
** Python<br />
** Ruby<br />
** JavaScript<br />
* Web: <br />
** Nginx<br />
** Phusion Passenger<br />
** Redis<br />
** Node.js<br />
* Tools:<br />
** Screen<br />
** Rsync<br />
** Git<br />
** Wget<br />
** regular expressions<br />
<br />
== The Tracker ==<br />
<br />
The Tracker manages what items are claimed by users that run the Seesaw client. It also shows a pretty leaderboard.<br />
<br />
Let's create a dedicated account to run the web server and tracker:<br />
<br />
sudo adduser --system --group --shell /bin/bash tracker<br />
<br />
=== Redis ===<br />
<br />
Redis is database stored in memory. So, item names should be engineered to be memory efficient. Redis saves its database periodically into a file located at /var/lib/redis/6379/dump.rdb. It is safe to copy the file, e.g., for backups.<br />
<br />
To install Redis, you may follow these [http://redis.io/topics/quickstart quickstart instructions], but we'll show you how.<br />
<br />
These steps are from the quickstart guide:<br />
<br />
wget http://download.redis.io/redis-stable.tar.gz<br />
tar xvzf redis-stable.tar.gz<br />
cd redis-stable<br />
make<br />
<br />
Now install the server:<br />
<br />
sudo make install<br />
cd utils<br />
sudo ./install_server.sh<br />
<br />
Note, by default, it runs as root. Let's stop it and make it run under www-data:<br />
<br />
sudo invoke-rc.d redis_6379 stop<br />
sudo adduser --system --group www-data<br />
sudo chown -R www-data:www-data /var/lib/redis/6379/<br />
sudo chown -R www-data:www-data /var/log/redis_6379.log<br />
<br />
Edit the config file <code>/etc/redis/6379.conf</code> with the options like:<br />
<br />
bind 127.0.0.1<br />
pidfile /var/run/shm/redis_6379.pid<br />
<br />
Now tell the start up script to run it as www-data:<br />
<br />
sudo nano /etc/init.d/redis_6379<br />
<br />
Change the EXEC and CLIEXEC variables to use <code>sudo -u www-data -g www-data</code>:<br />
<br />
EXEC="sudo -u www-data -g www-data /usr/local/bin/redis-server"<br />
CLIEXEC="sudo -u www-data -g www-data /usr/local/bin/redis-cli"<br />
PIDFILE=/var/run/shm/redis_6379.pid<br />
<br />
To avoid catastrophe with background saves failing on <code>fork()</code> (Redis needs lots of memory), run:<br />
<br />
sudo sysctl vm.overcommit_memory=1<br />
<br />
The above setting will be lost after reboot. Add this line to <code>/etc/sysctl.conf</code>:<br />
<br />
vm.overcommit_memory=1<br />
<br />
The log file will get big so we need a logrotate config. Create one at <code>/etc/logrotate.d/redis</code> with the config:<br />
<br />
/var/log/redis_*.log {<br />
daily<br />
rotate 10<br />
copytruncate<br />
delaycompress<br />
compress<br />
notifempty<br />
missingok<br />
size 10M<br />
}<br />
<br />
Start up Redis again using:<br />
<br />
sudo invoke-rc.d redis_6379 start<br />
<br />
=== Nginx with Passenger ===<br />
<br />
Nginx is a web server. Phusion Passenger is a module within Nginx that runs Rails applications.<br />
<br />
There is a [https://www.digitalocean.com/community/articles/how-to-install-rails-and-nginx-with-passenger-on-ubuntu guide] on how to install Nginx with Passenger, the following instructions are similar.<br />
<br />
Log in as tracker:<br />
<br />
sudo -u tracker -i<br />
<br />
We'll use RVM to install Ruby libraries:<br />
<br />
curl -L get.rvm.io | bash -s stable<br />
source ~/.rvm/scripts/rvm<br />
rvm requirements<br />
<br />
A list of things needed to be installed will be shown. Log out of the tracker account, install them, and log back into the tracker account.<br />
<br />
Install Ruby and Bundler:<br />
<br />
rvm install 2.2.2<br />
rvm rubygems current<br />
gem install bundler<br />
<br />
Install Passenger:<br />
<br />
gem install passenger<br />
<br />
Install Nginx. This command will download, compile, and install a basic Nginx server.:<br />
<br />
passenger-install-nginx-module<br />
<br />
Use the following prefix for Nginx installation:<br />
<br />
/home/tracker/nginx/<br />
<br />
Change the location of the tracker software (to be installed later). Edit <code>nginx/conf/nginx.conf</code>. Use the lines under the "location /" option:<br />
<br />
root /home/tracker/universal-tracker/public;<br />
passenger_enabled on;<br />
client_max_body_size 15M;<br />
<br />
The logs will get big so we'll use logrotate. Save this into <code>/home/tracker/logrotate.conf</code>:<br />
<br />
/home/tracker/nginx/logs/error.log<br />
/home/tracker/nginx/logs/access.log {<br />
daily<br />
rotate 10<br />
copytruncate<br />
delaycompress<br />
compress<br />
notifempty<br />
missingok<br />
size 10M<br />
}<br />
<br />
To call logrotate, we'll add an entry using crontab:<br />
<br />
crontab -e<br />
<br />
Now add the following line:<br />
<br />
@daily /usr/sbin/logrotate --state /home/tracker/.logrotate.state /home/tracker/logrotate.conf<br />
<br />
Log out of the tracker account at this point.<br />
<br />
Let's create an Upstart configuration file to start up Nginx. Save this into <code>/etc/init/nginx-tracker.conf</code>:<br />
<br />
description "nginx http daemon"<br />
<br />
start on runlevel [2]<br />
stop on runlevel [016]<br />
<br />
setuid tracker<br />
setgid tracker<br />
<br />
console output<br />
<br />
exec /home/tracker/nginx/sbin/nginx -c /home/tracker/nginx/conf/nginx.conf -g "daemon off;"<br />
<br />
=== Tracker ===<br />
<br />
Log in into the tracker account.<br />
<br />
Download the Tracker software:<br />
<br />
git clone https://github.com/ArchiveTeam/universal-tracker.git<br />
<br />
We'll need to configure the location of Redis. Copy the config file:<br />
<br />
cp universal-tracker/config/redis.json.example universal-tracker/config/redis.json<br />
<br />
Add a "production" object into the JSON file. Here is an example:<br />
<br />
{<br />
"development": {<br />
"host": "127.0.0.1",<br />
"port": 6379,<br />
"db": 13<br />
},<br />
"test": {<br />
"host": "127.0.0.1",<br />
"port": 6379,<br />
"db": 14<br />
},<br />
"production": {<br />
"host":"127.0.0.1",<br />
"port":6379,<br />
"db": 1<br />
}<br />
}<br />
<br />
* Now we may need to fix an issue with Passenger forking after the Redis connection has been made. Please see https://github.com/ArchiveTeam/universal-tracker/issues/5 for more information.<br />
* There is also an issue with non-ASCII names. See https://github.com/ArchiveTeam/universal-tracker/issues/7.<br />
<br />
Now install the necessary gems:<br />
<br />
cd universal-tracker<br />
bundle install<br />
<br />
Log out of the tracker account at this point.<br />
<br />
=== Node.js ===<br />
<br />
Node.js is required to run the fancy leaderboard using WebSockets. We'll use NPM to manage the Node.js libraries:<br />
<br />
sudo apt-get install npm<br />
<br />
Log into the tracker account.<br />
<br />
Now, we manually edit the Node.js program because it has problems:<br />
<br />
cp -R universal-tracker/broadcaster .<br />
nano broadcaster/server.js<br />
<br />
Modify <code>env</code> and <code>trackerConfig</code> variables to something like this:<br />
<br />
var env = {<br />
tracker_config: {<br />
redis_pubsub_channel: "tracker-log"<br />
},<br />
redis_db: 1<br />
};<br />
var trackerConfig = env['tracker_config'];<br />
<br />
You also need to modify the "transports" configuration by adding <code>websocket</code>. The new line should look like this:<br />
<br />
io.set("transports", ["websocket", "xhr-polling"]);<br />
<br />
Install the Node.js libraries needed:<br />
<br />
npm install socket.io<br />
npm install redis<br />
<br />
Log out of the tracker account at this point.<br />
<br />
Create an Upstart file at <code>/etc/init/nodejs-tracker.conf</code>:<br />
<br />
description "tracker nodejs daemon"<br />
<br />
start on runlevel [2]<br />
stop on runlevel [016]<br />
<br />
setuid tracker<br />
setgid tracker<br />
<br />
exec node /home/tracker/broadcaster/server.js<br />
<br />
=== Tracker Setup ===<br />
<br />
Start up the Tracker and Broadcaster:<br />
<br />
sudo start nginx-tracker<br />
sudo start nodejs-tracker<br />
<br />
You now need to configure the tracker. Open up your web browser and visit http://localhost/global-admin/.<br />
<br />
* In Global-Admin→Configuration→Live logging host, specify the public location of the Node.js app. By default, it uses port 8080.<br />
<br />
You are now free to manage the tracker.<br />
<br />
Notes:<br />
<br />
* If you followed this guide, the rsync location is defined as <code>rsync://HOSTNAME/PROJECT_NAME/:downloader/</code><br />
* The '''''trailing slash''''' within the rsync URL is very important. Without it, files will not be uploaded within the directory.<br />
<br />
==== Claims ====<br />
<br />
You probably want to have Cron clearing out old claims. The Tracker includes a Ruby script that will do that for you. By default, it removes claims older than 6 hours. You may want to change that for big items by creating a copy of the script for each project.<br />
<br />
To set up Cron, login as the tracker account, and run:<br />
<br />
which ruby<br />
<br />
Take note of which Ruby executable is used.<br />
<br />
Now edit the Cron table:<br />
<br />
crontab -e<br />
<br />
Add the following line which runs <code>release-stale.rb</code> every 6 hours:<br />
<br />
0 */6 * * * cd /home/tracker/universal-tracker && WHICH_RUBY scripts/release-stale.rb PROJECT_NAME<br />
<br />
==== Logs ====<br />
<br />
Since the Tracker stores logs into Redis, it will use up memory quickly. <code>log-drainer.rb</code> continuously writes the logs into a text file:<br />
<br />
mkdir -p /home/tracker/universal-tracker/logs/<br />
cd /home/tracker/universal-tracker && ruby scripts/log-drainer.rb<br />
<br />
Pressing CTRL+C will stop it. Run this within a Screen session.<br />
<br />
This crontab entry will compress the log files that haven't been modified in two days:<br />
<br />
@daily find /home/tracker/universal-tracker/logs/ -iname "*.log" -mtime +2 -exec xz {} \;<br />
<br />
==== Reducing memory usage ====<br />
<br />
The Passenger Ruby module may use up too much memory. You can add the following lines to your nginx config. Add these inside the <code>http</code> block:<br />
<br />
passenger_max_pool_size 2;<br />
passenger_max_requests 10000;<br />
<br />
The first line allows spawning maximum of 2 processes. The second line restarts Passenger after 10,000 requests to free memory caused by memory leaks.<br />
<br />
{{devnav}}</div>Yipdwhttps://wiki.archiveteam.org/index.php?title=Dev/Tracker&diff=22888Dev/Tracker2015-04-18T22:35:14Z<p>Yipdw: Rails isn't needed; Bundler name is "bundler", not "bundle"</p>
<hr />
<div>This article describes how to set up your own '''[[tracker]]''' just like the official Archive Team tracker. Use this guide only if you want to do a full test of the infrastructure.<br />
<br />
'''Note:''' A virtual machine appliance is available at [https://github.com/ArchiveTeam/archiveteam-dev-env ArchiveTeam/archiveteam-dev-env] which contains a ready-to-use tracker.<br />
<br />
Installation will cover:<br />
<br />
* Environment: Ubuntu/Debian<br />
* Languages:<br />
** Python<br />
** Ruby<br />
** JavaScript<br />
* Web: <br />
** Nginx<br />
** Phusion Passenger<br />
** Redis<br />
** Node.js<br />
* Tools:<br />
** Screen<br />
** Rsync<br />
** Git<br />
** Wget<br />
** regular expressions<br />
<br />
== The Tracker ==<br />
<br />
The Tracker manages what items are claimed by users that run the Seesaw client. It also shows a pretty leaderboard.<br />
<br />
Let's create a dedicated account to run the web server and tracker:<br />
<br />
sudo adduser --system --group --shell /bin/bash tracker<br />
<br />
=== Redis ===<br />
<br />
Redis is database stored in memory. So, item names should be engineered to be memory efficient. Redis saves its database periodically into a file located at /var/lib/redis/6379/dump.rdb. It is safe to copy the file, e.g., for backups.<br />
<br />
To install Redis, you may follow these [http://redis.io/topics/quickstart quickstart instructions], but we'll show you how.<br />
<br />
These steps are from the quickstart guide:<br />
<br />
wget http://download.redis.io/redis-stable.tar.gz<br />
tar xvzf redis-stable.tar.gz<br />
cd redis-stable<br />
make<br />
<br />
Now install the server:<br />
<br />
sudo make install<br />
cd utils<br />
sudo ./install_server.sh<br />
<br />
Note, by default, it runs as root. Let's stop it and make it run under www-data:<br />
<br />
sudo invoke-rc.d redis_6379 stop<br />
sudo adduser --system --group www-data<br />
sudo chown -R www-data:www-data /var/lib/redis/6379/<br />
sudo chown -R www-data:www-data /var/log/redis_6379.log<br />
<br />
Edit the config file <code>/etc/redis/6379.conf</code> with the options like:<br />
<br />
bind 127.0.0.1<br />
pidfile /var/run/shm/redis_6379.pid<br />
<br />
Now tell the start up script to run it as www-data:<br />
<br />
sudo nano /etc/init.d/redis_6379<br />
<br />
Change the EXEC and CLIEXEC variables to use <code>sudo -u www-data -g www-data</code>:<br />
<br />
EXEC="sudo -u www-data -g www-data /usr/local/bin/redis-server"<br />
CLIEXEC="sudo -u www-data -g www-data /usr/local/bin/redis-cli"<br />
PIDFILE=/var/run/shm/redis_6379.pid<br />
<br />
To avoid catastrophe with background saves failing on <code>fork()</code> (Redis needs lots of memory), run:<br />
<br />
sudo sysctl vm.overcommit_memory=1<br />
<br />
The above setting will be lost after reboot. Add this line to <code>/etc/sysctl.conf</code>:<br />
<br />
vm.overcommit_memory=1<br />
<br />
The log file will get big so we need a logrotate config. Create one at <code>/etc/logrotate.d/redis</code> with the config:<br />
<br />
/var/log/redis_*.log {<br />
daily<br />
rotate 10<br />
copytruncate<br />
delaycompress<br />
compress<br />
notifempty<br />
missingok<br />
size 10M<br />
}<br />
<br />
Start up Redis again using:<br />
<br />
sudo invoke-rc.d redis_6379 start<br />
<br />
=== Nginx with Passenger ===<br />
<br />
Nginx is a web server. Phusion Passenger is a module within Nginx that runs Rails applications.<br />
<br />
There is a [https://www.digitalocean.com/community/articles/how-to-install-rails-and-nginx-with-passenger-on-ubuntu guide] on how to install Nginx with Passenger, the following instructions are similar.<br />
<br />
Log in as tracker:<br />
<br />
sudo -u tracker -i<br />
<br />
We'll use RVM to install Ruby libraries:<br />
<br />
curl -L get.rvm.io | bash -s stable<br />
source ~/.rvm/scripts/rvm<br />
rvm requirements<br />
<br />
A list of things needed to be installed will be shown. Log out of the tracker account, install them, and log back into the tracker account.<br />
<br />
Install Ruby and Bundler:<br />
<br />
rvm install 2.0<br />
rvm rubygems current<br />
gem install bundler<br />
<br />
Install Passenger:<br />
<br />
gem install passenger<br />
<br />
Install Nginx. This command will download, compile, and install a basic Nginx server.:<br />
<br />
passenger-install-nginx-module<br />
<br />
Use the following prefix for Nginx installation:<br />
<br />
/home/tracker/nginx/<br />
<br />
Change the location of the tracker software (to be installed later). Edit <code>nginx/conf/nginx.conf</code>. Use the lines under the "location /" option:<br />
<br />
root /home/tracker/universal-tracker/public;<br />
passenger_enabled on;<br />
client_max_body_size 15M;<br />
<br />
The logs will get big so we'll use logrotate. Save this into <code>/home/tracker/logrotate.conf</code>:<br />
<br />
/home/tracker/nginx/logs/error.log<br />
/home/tracker/nginx/logs/access.log {<br />
daily<br />
rotate 10<br />
copytruncate<br />
delaycompress<br />
compress<br />
notifempty<br />
missingok<br />
size 10M<br />
}<br />
<br />
To call logrotate, we'll add an entry using crontab:<br />
<br />
crontab -e<br />
<br />
Now add the following line:<br />
<br />
@daily /usr/sbin/logrotate --state /home/tracker/.logrotate.state /home/tracker/logrotate.conf<br />
<br />
Log out of the tracker account at this point.<br />
<br />
Let's create an Upstart configuration file to start up Nginx. Save this into <code>/etc/init/nginx-tracker.conf</code>:<br />
<br />
description "nginx http daemon"<br />
<br />
start on runlevel [2]<br />
stop on runlevel [016]<br />
<br />
setuid tracker<br />
setgid tracker<br />
<br />
console output<br />
<br />
exec /home/tracker/nginx/sbin/nginx -c /home/tracker/nginx/conf/nginx.conf -g "daemon off;"<br />
<br />
=== Tracker ===<br />
<br />
Log in into the tracker account.<br />
<br />
Download the Tracker software:<br />
<br />
git clone https://github.com/ArchiveTeam/universal-tracker.git<br />
<br />
We'll need to configure the location of Redis. Copy the config file:<br />
<br />
cp universal-tracker/config/redis.json.example universal-tracker/config/redis.json<br />
<br />
Add a "production" object into the JSON file. Here is an example:<br />
<br />
{<br />
"development": {<br />
"host": "127.0.0.1",<br />
"port": 6379,<br />
"db": 13<br />
},<br />
"test": {<br />
"host": "127.0.0.1",<br />
"port": 6379,<br />
"db": 14<br />
},<br />
"production": {<br />
"host":"127.0.0.1",<br />
"port":6379,<br />
"db": 1<br />
}<br />
}<br />
<br />
* Now we may need to fix an issue with Passenger forking after the Redis connection has been made. Please see https://github.com/ArchiveTeam/universal-tracker/issues/5 for more information.<br />
* There is also an issue with non-ASCII names. See https://github.com/ArchiveTeam/universal-tracker/issues/7.<br />
<br />
Now install the necessary gems:<br />
<br />
cd universal-tracker<br />
bundle install<br />
<br />
Log out of the tracker account at this point.<br />
<br />
=== Node.js ===<br />
<br />
Node.js is required to run the fancy leaderboard using WebSockets. We'll use NPM to manage the Node.js libraries:<br />
<br />
sudo apt-get install npm<br />
<br />
Log into the tracker account.<br />
<br />
Now, we manually edit the Node.js program because it has problems:<br />
<br />
cp -R universal-tracker/broadcaster .<br />
nano broadcaster/server.js<br />
<br />
Modify <code>env</code> and <code>trackerConfig</code> variables to something like this:<br />
<br />
var env = {<br />
tracker_config: {<br />
redis_pubsub_channel: "tracker-log"<br />
},<br />
redis_db: 1<br />
};<br />
var trackerConfig = env['tracker_config'];<br />
<br />
You also need to modify the "transports" configuration by adding <code>websocket</code>. The new line should look like this:<br />
<br />
io.set("transports", ["websocket", "xhr-polling"]);<br />
<br />
Install the Node.js libraries needed:<br />
<br />
npm install socket.io<br />
npm install redis<br />
<br />
Log out of the tracker account at this point.<br />
<br />
Create an Upstart file at <code>/etc/init/nodejs-tracker.conf</code>:<br />
<br />
description "tracker nodejs daemon"<br />
<br />
start on runlevel [2]<br />
stop on runlevel [016]<br />
<br />
setuid tracker<br />
setgid tracker<br />
<br />
exec node /home/tracker/broadcaster/server.js<br />
<br />
=== Tracker Setup ===<br />
<br />
Start up the Tracker and Broadcaster:<br />
<br />
sudo start nginx-tracker<br />
sudo start nodejs-tracker<br />
<br />
You now need to configure the tracker. Open up your web browser and visit http://localhost/global-admin/.<br />
<br />
* In Global-Admin→Configuration→Live logging host, specify the public location of the Node.js app. By default, it uses port 8080.<br />
<br />
You are now free to manage the tracker.<br />
<br />
Notes:<br />
<br />
* If you followed this guide, the rsync location is defined as <code>rsync://HOSTNAME/PROJECT_NAME/:downloader/</code><br />
* The '''''trailing slash''''' within the rsync URL is very important. Without it, files will not be uploaded within the directory.<br />
<br />
==== Claims ====<br />
<br />
You probably want to have Cron clearing out old claims. The Tracker includes a Ruby script that will do that for you. By default, it removes claims older than 6 hours. You may want to change that for big items by creating a copy of the script for each project.<br />
<br />
To set up Cron, login as the tracker account, and run:<br />
<br />
which ruby<br />
<br />
Take note of which Ruby executable is used.<br />
<br />
Now edit the Cron table:<br />
<br />
crontab -e<br />
<br />
Add the following line which runs <code>release-stale.rb</code> every 6 hours:<br />
<br />
0 */6 * * * cd /home/tracker/universal-tracker && WHICH_RUBY scripts/release-stale.rb PROJECT_NAME<br />
<br />
==== Logs ====<br />
<br />
Since the Tracker stores logs into Redis, it will use up memory quickly. <code>log-drainer.rb</code> continuously writes the logs into a text file:<br />
<br />
mkdir -p /home/tracker/universal-tracker/logs/<br />
cd /home/tracker/universal-tracker && ruby scripts/log-drainer.rb<br />
<br />
Pressing CTRL+C will stop it. Run this within a Screen session.<br />
<br />
This crontab entry will compress the log files that haven't been modified in two days:<br />
<br />
@daily find /home/tracker/universal-tracker/logs/ -iname "*.log" -mtime +2 -exec xz {} \;<br />
<br />
==== Reducing memory usage ====<br />
<br />
The Passenger Ruby module may use up too much memory. You can add the following lines to your nginx config. Add these inside the <code>http</code> block:<br />
<br />
passenger_max_pool_size 2;<br />
passenger_max_requests 10000;<br />
<br />
The first line allows spawning maximum of 2 processes. The second line restarts Passenger after 10,000 requests to free memory caused by memory leaks.<br />
<br />
{{devnav}}</div>Yipdwhttps://wiki.archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/git-annex_implementation&diff=22142INTERNETARCHIVE.BAK/git-annex implementation2015-03-04T23:52:00Z<p>Yipdw: </p>
<hr />
<div>This page addresses a [https://git-annex.branchable.com git-annex] implementation of [[INTERNETARCHIVE.BAK]].<br />
<br />
For more information, see http://git-annex.branchable.com/design/iabackup/.<br />
<br />
= Some quick info on Internet Archive =<br />
<br />
== Data model ==<br />
<br />
IA's data is organized into ''collections'' and ''items''. One collection contains many items. An item contains files of the same type such as multiple MP3 files in an album or a single ISO image file. (A PDF manual and its software should go in separate items.)<br />
<br />
Here's an example collection and item in that collection: https://archive.org/details/archiveteam-fire, https://archive.org/details/proust-panic-download-warc.<br />
<br />
== Browsing the Internet Archive ==<br />
<br />
In addition to the web interface, you can use the [https://pypi.python.org/pypi/internetarchive Internet Archive command-line tool]. The tool currently requires a Python 2.x installation. If you use pip, run<br />
<br />
<pre><br />
pip install internetarchive<br />
</pre><br />
<br />
See https://pypi.python.org/pypi/internetarchive#command-line-usage for usage information. If you want to start digging, you might find it useful to issue <code>ia search 'collection:*'</code>; this'll return summary information for all of IA's items.<br />
<br />
= First tasks =<br />
<br />
<pre><br />
<closure> SketchCow: I have to work on git-annex development all day (what a fate), not this, and I'm doing 7drl 24x7 all next week. Some first steps others could do:<br />
<closure> - pick a set of around 10 thousand items whose size sums to around 8 TB<br />
<closure> - build map from Item to shard. Needs to scale well to 24+ million. sql?<br />
<closure> - write ingestion script that takes an item and generates a tarball of its non-derived files. Needs to be able to reproduce the same checksum each time run on an (unmodified) item. I know how to make tar and gz reproducible, BTW<br />
<closure> - write client registration backend, which generates the client's ssh private key, git-annex UUID, and sends them to the client (somehow tied to IA library cards?)<br />
<closure> - client runtime environment (docker image maybe?) with warrior-like interface<br />
<closure> (all that needs to do is configure things and get git-annex running)<br />
<closure> could someone wiki that? ta<br />
</pre></div>Yipdwhttps://wiki.archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/git-annex_implementation&diff=22140INTERNETARCHIVE.BAK/git-annex implementation2015-03-04T23:41:19Z<p>Yipdw: /* Browsing the Internet Archive */</p>
<hr />
<div>This page addresses a [https://git-annex.branchable.com git-annex] implementation of [[INTERNETARCHIVE.BAK]].<br />
<br />
For more information, see http://git-annex.branchable.com/design/iabackup/.<br />
<br />
= Some quick info on Internet Archive =<br />
<br />
== Data model ==<br />
<br />
IA's data is organized into ''collections'' and ''items''. One collection contains many items.<br />
<br />
Here's an example collection and item in that collection: https://archive.org/details/archiveteam-fire, https://archive.org/details/proust-panic-download-warc.<br />
<br />
== Browsing the Internet Archive ==<br />
<br />
In addition to the web interface, you can use the [https://pypi.python.org/pypi/internetarchive Internet Archive command-line tool]. The tool currently requires a Python 2.x installation. If you use pip, run<br />
<br />
<pre><br />
pip install internetarchive<br />
</pre><br />
<br />
From there, you can run <code>ia search 'collection:*'</code> to get information on all collections as a JSON array. (It's a big list.) See https://pypi.python.org/pypi/internetarchive#command-line-usage for more information.<br />
<br />
= First tasks =<br />
<br />
<pre><br />
<closure> SketchCow: I have to work on git-annex development all day (what a fate), not this, and I'm doing 7drl 24x7 all next week. Some first steps others could do:<br />
<closure> - pick a set of around 10 thousand items whose size sums to around 8 TB<br />
<closure> - build map from Item to shard. Needs to scale well to 24+ million. sql?<br />
<closure> - write ingestion script that takes an item and generates a tarball of its non-derived files. Needs to be able to reproduce the same checksum each time run on an (unmodified) item. I know how to make tar and gz reproducible, BTW<br />
<closure> - write client registration backend, which generates the client's ssh private key, git-annex UUID, and sends them to the client (somehow tied to IA library cards?)<br />
<closure> - client runtime environment (docker image maybe?) with warrior-like interface<br />
<closure> (all that needs to do is configure things and get git-annex running)<br />
<closure> could someone wiki that? ta<br />
</pre></div>Yipdwhttps://wiki.archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/git-annex_implementation&diff=22139INTERNETARCHIVE.BAK/git-annex implementation2015-03-04T23:40:56Z<p>Yipdw: </p>
<hr />
<div>This page addresses a [https://git-annex.branchable.com git-annex] implementation of [[INTERNETARCHIVE.BAK]].<br />
<br />
For more information, see http://git-annex.branchable.com/design/iabackup/.<br />
<br />
= Some quick info on Internet Archive =<br />
<br />
== Data model ==<br />
<br />
IA's data is organized into ''collections'' and ''items''. One collection contains many items.<br />
<br />
Here's an example collection and item in that collection: https://archive.org/details/archiveteam-fire, https://archive.org/details/proust-panic-download-warc.<br />
<br />
== Browsing the Internet Archive ==<br />
<br />
In addition to the web interface, you can use the [https://pypi.python.org/pypi/internetarchive Internet Archive command-line tool]. The tool currently requires a Python 2.x installation. If you use pip, run<br />
<br />
<pre><br />
pip install internetarchive<br />
</pre><br />
<br />
From there, you can run `ia search 'collection:*'` to get information on all collections as a JSON array. (It's a big list.) See https://pypi.python.org/pypi/internetarchive#command-line-usage for more information.<br />
<br />
= First tasks =<br />
<br />
<pre><br />
<closure> SketchCow: I have to work on git-annex development all day (what a fate), not this, and I'm doing 7drl 24x7 all next week. Some first steps others could do:<br />
<closure> - pick a set of around 10 thousand items whose size sums to around 8 TB<br />
<closure> - build map from Item to shard. Needs to scale well to 24+ million. sql?<br />
<closure> - write ingestion script that takes an item and generates a tarball of its non-derived files. Needs to be able to reproduce the same checksum each time run on an (unmodified) item. I know how to make tar and gz reproducible, BTW<br />
<closure> - write client registration backend, which generates the client's ssh private key, git-annex UUID, and sends them to the client (somehow tied to IA library cards?)<br />
<closure> - client runtime environment (docker image maybe?) with warrior-like interface<br />
<closure> (all that needs to do is configure things and get git-annex running)<br />
<closure> could someone wiki that? ta<br />
</pre></div>Yipdwhttps://wiki.archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/git-annex_implementation&diff=22138INTERNETARCHIVE.BAK/git-annex implementation2015-03-04T23:36:44Z<p>Yipdw: /* Internet Archive's structure */</p>
<hr />
<div>This page addresses a [https://git-annex.branchable.com git-annex] implementation of [[INTERNETARCHIVE.BAK]].<br />
<br />
For more information, see http://git-annex.branchable.com/design/iabackup/.<br />
<br />
= Internet Archive's structure =<br />
<br />
IA's data is organized into ''collections'' and ''items''. One collection contains many items.<br />
<br />
Here's an example collection and item in that collection: https://archive.org/details/archiveteam-fire, https://archive.org/details/proust-panic-download-warc.<br />
<br />
= First tasks =<br />
<br />
<pre><br />
<closure> SketchCow: I have to work on git-annex development all day (what a fate), not this, and I'm doing 7drl 24x7 all next week. Some first steps others could do:<br />
<closure> - pick a set of around 10 thousand items whose size sums to around 8 TB<br />
<closure> - build map from Item to shard. Needs to scale well to 24+ million. sql?<br />
<closure> - write ingestion script that takes an item and generates a tarball of its non-derived files. Needs to be able to reproduce the same checksum each time run on an (unmodified) item. I know how to make tar and gz reproducible, BTW<br />
<closure> - write client registration backend, which generates the client's ssh private key, git-annex UUID, and sends them to the client (somehow tied to IA library cards?)<br />
<closure> - client runtime environment (docker image maybe?) with warrior-like interface<br />
<closure> (all that needs to do is configure things and get git-annex running)<br />
<closure> could someone wiki that? ta<br />
</pre></div>Yipdwhttps://wiki.archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/git-annex_implementation&diff=22137INTERNETARCHIVE.BAK/git-annex implementation2015-03-04T23:35:51Z<p>Yipdw: </p>
<hr />
<div>This page addresses a [https://git-annex.branchable.com git-annex] implementation of [[INTERNETARCHIVE.BAK]].<br />
<br />
For more information, see http://git-annex.branchable.com/design/iabackup/.<br />
<br />
= Internet Archive's structure =<br />
<br />
IA's data is organized into _collections_ and _items_; one collection contains many items.<br />
<br />
Here's an example collection: https://archive.org/details/archiveteam-fire<br />
...and here's an item in that collection: https://archive.org/details/proust-panic-download-warc<br />
<br />
= First tasks =<br />
<br />
<pre><br />
<closure> SketchCow: I have to work on git-annex development all day (what a fate), not this, and I'm doing 7drl 24x7 all next week. Some first steps others could do:<br />
<closure> - pick a set of around 10 thousand items whose size sums to around 8 TB<br />
<closure> - build map from Item to shard. Needs to scale well to 24+ million. sql?<br />
<closure> - write ingestion script that takes an item and generates a tarball of its non-derived files. Needs to be able to reproduce the same checksum each time run on an (unmodified) item. I know how to make tar and gz reproducible, BTW<br />
<closure> - write client registration backend, which generates the client's ssh private key, git-annex UUID, and sends them to the client (somehow tied to IA library cards?)<br />
<closure> - client runtime environment (docker image maybe?) with warrior-like interface<br />
<closure> (all that needs to do is configure things and get git-annex running)<br />
<closure> could someone wiki that? ta<br />
</pre></div>Yipdwhttps://wiki.archiveteam.org/index.php?title=INTERNETARCHIVE.BAK&diff=22133INTERNETARCHIVE.BAK2015-03-04T18:42:21Z<p>Yipdw: </p>
<hr />
<div>__NOTOC__<br />
<br />
''For current implementation details, see [[INTERNETARCHIVE.BAK/git-annex_implementation]].''<br />
<br />
== INTERNETARCHIVE.BAK ==<br />
<br />
The wonder of the [[Internet Archive]]'s petabytes of stored data is that they're a world treasure, providing access to a mass of information and stored culture, gathered from decades of history (in some case, centuries), and available in a (relatively) easy to find fashion. And as media and popular websites begin to cover the Archive's mission in earnest, the audience is growing notably. <br />
<br />
In each wave of interest, two questions come forward out of the accolades: What is the disaster recovery plan? And why is it in only one place?<br />
<br />
The disaster recovery plan is variant but generally reliant on multiple locations for the data, and it is in one place because the fundraising and support methods so far can only provide a certain amount of disaster/backup plans.<br />
<br />
Therefore, it is time for Archive Team to launch its most audacious project yet: Backing up the Internet Archive.<br />
<br />
== WELL THAT IS SERIOUSLY FUCKING IMPOSSIBLE ==<br />
<br />
That is a very natural and understandable reaction. Before we go further, let us quickly cover some facts about the Archive's datastores.<br />
<br />
* Internet Archive has roughly 21 petabytes of unique data at this juncture. (It grows daily.)<br />
* Some of that is more critical and disaster-worrisome than others. (Web crawls versus TV.)<br />
* Some of that is disconnected/redundant data.<br />
* A lot of it, the vast, vast majority, is extremely static, and meant to live forever.<br />
* A lot of it are "derives", and marked as such - files that were derived from other files.<br />
<br />
Obviously, numbers need to run, but it's less than 20 petabytes, and ultimately, ''20 petabytes isn't that much''<br />
<br />
Ultimately, 20 petabytes is 42,000 500gb chunks. 500gb drives are not expensive. People are throwing a lot of them out. And, obviously, we have drives of 1tb, 2tb, even 6tb, which are multiple amounts of "chunks".<br />
<br />
The vision I have, is this: A reversal, of sorts, of [[ArchiveBot]] - a service that allows people to provide hard drive space and hard drives, such that they can volunteer to be part of a massive Internet Archive "virtual drive" that will hold multiple (yes multiple) copies of the Internet Archive. <br />
<br />
== THE PHILOSOPHY ==<br />
<br />
There is an effort called LOCKSS (Lots of Copies Keeps Stuff Safe) at Stanford [http://www.lockss.org/] which is meant to provide as many tools and opportunities to save digital data as easily and quickly as possible. At Google, I've been told, they try for at least five copies of data stored in at least three physical locations. This is meant to provide a similar situation for the Internet Archive.<br />
<br />
While this is kind of nutty and can be considered a strange ad-hoc situation, I believe that, given the opportunity to play a part in non-geographically-dependent copies of the Internet Archive, many folks will step forward, and we will have a good solution until the expensive "good" solution comes along. Also, it is a very nice statement of support.<br />
<br />
== THE IMPLEMENTATION ==<br />
<br />
In this, there is an ArchiveBot-like service that keeps track of "The Drive", the backup of the Internet Archive. This "Drive" has sectors that are a certain size, preferably 500gb but smaller amounts might make sense. The "items" being backed up are Internet Archive items, with the derivations not included (so it will keep the uploaded .wav file but not the derived .mp3 and .ogg files). These "sectors" are then checked into the virtual drive, and based on if there's zero, one, two or more than two copies of the item in "The Drive", a color-coding is assigned. (Red, Yellow, Green). <br />
<br />
For an end user, this includes two basic functions: copy, and verification.<br />
<br />
In copy, you shove your drive into a USB dock or point to some filespace on your RAID or maybe even some space on your laptop's hard drive and say "I contribute this to The Drive". Then the client will back up your assigned items onto the drive. It will do so in a manner than maintains the data integrity, but allow your local drive or directory to have the files accessed (it should not be encrypted, in other words). Once it's done and verified, it is checked into The Drive as a known copy.<br />
<br />
In verification, you will need to run the client once every period of time - after a while, say three months, your copy will be considered out of date to The Drive. If you do not check in after, say, six months, your copy will be considered stale and forgotten, and The Drive will lose a copy. (This is part of why you want at least two copies out there.<br />
<br />
Copy and Verification are end-user experiences, so they will initially have bumps, but over time, the Drive will have copies everywhere, around the world, in laptops and stacks of drives in closets, and in the case of what will be considered "high availability" items, the number of copies could be in the dozens or hundreds, ensuring fast return if a disaster hits.<br />
<br />
== CONCERNS ==<br />
<br />
More people can add concerns, but my main one is preparing against Bad Actors, where someone might mess with their copy of the sector of The Drive that they have. Protections and checks will have to be put in to make sure the given backups are in good shape. There will always be continued risk, however, and hence the "high availability" items where there will be lots of copies to "vote". NOTE: Lots of thoughts on bad actors are on the discussion page.<br />
<br />
There is also a thought about recovery - we want to be able to have the data pulled back, and that will mean a recovery system of some sort.<br />
<br />
== STEPS TOWARDS IMPLEMENTATION ==<br />
<br />
I'd like to see us try some prototypes, with a given item set that is limited to, say, 100gb. <br />
<br />
There is now a channel, '''#internetarchive.bak''', on EFNet, for discussion about the implementations and testing. Of course, the discussion tab of this page is where in-process tests can be put, so people do not re-do investigations ("Hey, what about...") to the point of fatigue.</div>Yipdwhttps://wiki.archiveteam.org/index.php?title=INTERNETARCHIVE.BAK&diff=22132INTERNETARCHIVE.BAK2015-03-04T18:42:07Z<p>Yipdw: </p>
<hr />
<div>__NOTOC__<br />
<br />
** For current implementation details, see [[INTERNETARCHIVE.BAK/git-annex_implementation]]. **<br />
<br />
== INTERNETARCHIVE.BAK ==<br />
<br />
The wonder of the [[Internet Archive]]'s petabytes of stored data is that they're a world treasure, providing access to a mass of information and stored culture, gathered from decades of history (in some case, centuries), and available in a (relatively) easy to find fashion. And as media and popular websites begin to cover the Archive's mission in earnest, the audience is growing notably. <br />
<br />
In each wave of interest, two questions come forward out of the accolades: What is the disaster recovery plan? And why is it in only one place?<br />
<br />
The disaster recovery plan is variant but generally reliant on multiple locations for the data, and it is in one place because the fundraising and support methods so far can only provide a certain amount of disaster/backup plans.<br />
<br />
Therefore, it is time for Archive Team to launch its most audacious project yet: Backing up the Internet Archive.<br />
<br />
== WELL THAT IS SERIOUSLY FUCKING IMPOSSIBLE ==<br />
<br />
That is a very natural and understandable reaction. Before we go further, let us quickly cover some facts about the Archive's datastores.<br />
<br />
* Internet Archive has roughly 21 petabytes of unique data at this juncture. (It grows daily.)<br />
* Some of that is more critical and disaster-worrisome than others. (Web crawls versus TV.)<br />
* Some of that is disconnected/redundant data.<br />
* A lot of it, the vast, vast majority, is extremely static, and meant to live forever.<br />
* A lot of it are "derives", and marked as such - files that were derived from other files.<br />
<br />
Obviously, numbers need to run, but it's less than 20 petabytes, and ultimately, ''20 petabytes isn't that much''<br />
<br />
Ultimately, 20 petabytes is 42,000 500gb chunks. 500gb drives are not expensive. People are throwing a lot of them out. And, obviously, we have drives of 1tb, 2tb, even 6tb, which are multiple amounts of "chunks".<br />
<br />
The vision I have, is this: A reversal, of sorts, of [[ArchiveBot]] - a service that allows people to provide hard drive space and hard drives, such that they can volunteer to be part of a massive Internet Archive "virtual drive" that will hold multiple (yes multiple) copies of the Internet Archive. <br />
<br />
== THE PHILOSOPHY ==<br />
<br />
There is an effort called LOCKSS (Lots of Copies Keeps Stuff Safe) at Stanford [http://www.lockss.org/] which is meant to provide as many tools and opportunities to save digital data as easily and quickly as possible. At Google, I've been told, they try for at least five copies of data stored in at least three physical locations. This is meant to provide a similar situation for the Internet Archive.<br />
<br />
While this is kind of nutty and can be considered a strange ad-hoc situation, I believe that, given the opportunity to play a part in non-geographically-dependent copies of the Internet Archive, many folks will step forward, and we will have a good solution until the expensive "good" solution comes along. Also, it is a very nice statement of support.<br />
<br />
== THE IMPLEMENTATION ==<br />
<br />
In this, there is an ArchiveBot-like service that keeps track of "The Drive", the backup of the Internet Archive. This "Drive" has sectors that are a certain size, preferably 500gb but smaller amounts might make sense. The "items" being backed up are Internet Archive items, with the derivations not included (so it will keep the uploaded .wav file but not the derived .mp3 and .ogg files). These "sectors" are then checked into the virtual drive, and based on if there's zero, one, two or more than two copies of the item in "The Drive", a color-coding is assigned. (Red, Yellow, Green). <br />
<br />
For an end user, this includes two basic functions: copy, and verification.<br />
<br />
In copy, you shove your drive into a USB dock or point to some filespace on your RAID or maybe even some space on your laptop's hard drive and say "I contribute this to The Drive". Then the client will back up your assigned items onto the drive. It will do so in a manner than maintains the data integrity, but allow your local drive or directory to have the files accessed (it should not be encrypted, in other words). Once it's done and verified, it is checked into The Drive as a known copy.<br />
<br />
In verification, you will need to run the client once every period of time - after a while, say three months, your copy will be considered out of date to The Drive. If you do not check in after, say, six months, your copy will be considered stale and forgotten, and The Drive will lose a copy. (This is part of why you want at least two copies out there.<br />
<br />
Copy and Verification are end-user experiences, so they will initially have bumps, but over time, the Drive will have copies everywhere, around the world, in laptops and stacks of drives in closets, and in the case of what will be considered "high availability" items, the number of copies could be in the dozens or hundreds, ensuring fast return if a disaster hits.<br />
<br />
== CONCERNS ==<br />
<br />
More people can add concerns, but my main one is preparing against Bad Actors, where someone might mess with their copy of the sector of The Drive that they have. Protections and checks will have to be put in to make sure the given backups are in good shape. There will always be continued risk, however, and hence the "high availability" items where there will be lots of copies to "vote". NOTE: Lots of thoughts on bad actors are on the discussion page.<br />
<br />
There is also a thought about recovery - we want to be able to have the data pulled back, and that will mean a recovery system of some sort.<br />
<br />
== STEPS TOWARDS IMPLEMENTATION ==<br />
<br />
I'd like to see us try some prototypes, with a given item set that is limited to, say, 100gb. <br />
<br />
There is now a channel, '''#internetarchive.bak''', on EFNet, for discussion about the implementations and testing. Of course, the discussion tab of this page is where in-process tests can be put, so people do not re-do investigations ("Hey, what about...") to the point of fatigue.</div>Yipdwhttps://wiki.archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/git-annex_implementation&diff=22131INTERNETARCHIVE.BAK/git-annex implementation2015-03-04T18:41:39Z<p>Yipdw: </p>
<hr />
<div>This page addresses a [https://git-annex.branchable.com git-annex] implementation of [[INTERNETARCHIVE.BAK]].<br />
<br />
= First tasks =<br />
<br />
<pre><br />
<closure> SketchCow: I have to work on git-annex development all day (what a fate), not this, and I'm doing 7drl 24x7 all next week. Some first steps others could do:<br />
<closure> - pick a set of around 10 thousand items whose size sums to around 8 TB<br />
<closure> - build map from Item to shard. Needs to scale well to 24+ million. sql?<br />
<closure> - write ingestion script that takes an item and generates a tarball of its non-derived files. Needs to be able to reproduce the same checksum each time run on an (unmodified) item. I know how to make tar and gz reproducible, BTW<br />
<closure> - write client registration backend, which generates the client's ssh private key, git-annex UUID, and sends them to the client (somehow tied to IA library cards?)<br />
<closure> - client runtime environment (docker image maybe?) with warrior-like interface<br />
<closure> (all that needs to do is configure things and get git-annex running)<br />
<closure> could someone wiki that? ta<br />
</pre></div>Yipdwhttps://wiki.archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/git-annex_implementation&diff=22130INTERNETARCHIVE.BAK/git-annex implementation2015-03-04T18:41:13Z<p>Yipdw: </p>
<hr />
<div>This page addresses a [https://git-annex.branchable.com|git-annex] implementation of [[INTERNETARCHIVE.BAK]].<br />
<br />
= First tasks =<br />
<br />
<pre><br />
<closure> SketchCow: I have to work on git-annex development all day (what a fate), not this, and I'm doing 7drl 24x7 all next week. Some first steps others could do:<br />
<closure> - pick a set of around 10 thousand items whose size sums to around 8 TB<br />
<closure> - build map from Item to shard. Needs to scale well to 24+ million. sql?<br />
<closure> - write ingestion script that takes an item and generates a tarball of its non-derived files. Needs to be able to reproduce the same checksum each time run on an (unmodified) item. I know how to make tar and gz reproducible, BTW<br />
<closure> - write client registration backend, which generates the client's ssh private key, git-annex UUID, and sends them to the client (somehow tied to IA library cards?)<br />
<closure> - client runtime environment (docker image maybe?) with warrior-like interface<br />
<closure> (all that needs to do is configure things and get git-annex running)<br />
<closure> could someone wiki that? ta<br />
</pre></div>Yipdwhttps://wiki.archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/git-annex_implementation&diff=22129INTERNETARCHIVE.BAK/git-annex implementation2015-03-04T18:41:06Z<p>Yipdw: HOW DO I WIKIMARKUP</p>
<hr />
<div>This page addresses a [[https://git-annex.branchable.com|git-annex]] implementation of [[INTERNETARCHIVE.BAK]].<br />
<br />
= First tasks =<br />
<br />
<pre><br />
<closure> SketchCow: I have to work on git-annex development all day (what a fate), not this, and I'm doing 7drl 24x7 all next week. Some first steps others could do:<br />
<closure> - pick a set of around 10 thousand items whose size sums to around 8 TB<br />
<closure> - build map from Item to shard. Needs to scale well to 24+ million. sql?<br />
<closure> - write ingestion script that takes an item and generates a tarball of its non-derived files. Needs to be able to reproduce the same checksum each time run on an (unmodified) item. I know how to make tar and gz reproducible, BTW<br />
<closure> - write client registration backend, which generates the client's ssh private key, git-annex UUID, and sends them to the client (somehow tied to IA library cards?)<br />
<closure> - client runtime environment (docker image maybe?) with warrior-like interface<br />
<closure> (all that needs to do is configure things and get git-annex running)<br />
<closure> could someone wiki that? ta<br />
</pre></div>Yipdwhttps://wiki.archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/git-annex_implementation&diff=22128INTERNETARCHIVE.BAK/git-annex implementation2015-03-04T18:40:54Z<p>Yipdw: </p>
<hr />
<div>This page addresses a [[git-annex|https://git-annex.branchable.com]] implementation of [[INTERNETARCHIVE.BAK]].<br />
<br />
= First tasks =<br />
<br />
<pre><br />
<closure> SketchCow: I have to work on git-annex development all day (what a fate), not this, and I'm doing 7drl 24x7 all next week. Some first steps others could do:<br />
<closure> - pick a set of around 10 thousand items whose size sums to around 8 TB<br />
<closure> - build map from Item to shard. Needs to scale well to 24+ million. sql?<br />
<closure> - write ingestion script that takes an item and generates a tarball of its non-derived files. Needs to be able to reproduce the same checksum each time run on an (unmodified) item. I know how to make tar and gz reproducible, BTW<br />
<closure> - write client registration backend, which generates the client's ssh private key, git-annex UUID, and sends them to the client (somehow tied to IA library cards?)<br />
<closure> - client runtime environment (docker image maybe?) with warrior-like interface<br />
<closure> (all that needs to do is configure things and get git-annex running)<br />
<closure> could someone wiki that? ta<br />
</pre></div>Yipdwhttps://wiki.archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/git-annex_implementation&diff=22127INTERNETARCHIVE.BAK/git-annex implementation2015-03-04T18:40:43Z<p>Yipdw: </p>
<hr />
<div>This page addresses a [git-annex|git-annex.branchable.com] implementation of [[INTERNETARCHIVE.BAK]].<br />
<br />
= First tasks =<br />
<br />
<pre><br />
<closure> SketchCow: I have to work on git-annex development all day (what a fate), not this, and I'm doing 7drl 24x7 all next week. Some first steps others could do:<br />
<closure> - pick a set of around 10 thousand items whose size sums to around 8 TB<br />
<closure> - build map from Item to shard. Needs to scale well to 24+ million. sql?<br />
<closure> - write ingestion script that takes an item and generates a tarball of its non-derived files. Needs to be able to reproduce the same checksum each time run on an (unmodified) item. I know how to make tar and gz reproducible, BTW<br />
<closure> - write client registration backend, which generates the client's ssh private key, git-annex UUID, and sends them to the client (somehow tied to IA library cards?)<br />
<closure> - client runtime environment (docker image maybe?) with warrior-like interface<br />
<closure> (all that needs to do is configure things and get git-annex running)<br />
<closure> could someone wiki that? ta<br />
</pre></div>Yipdwhttps://wiki.archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/git-annex_implementation&diff=22126INTERNETARCHIVE.BAK/git-annex implementation2015-03-04T18:40:05Z<p>Yipdw: </p>
<hr />
<div>= First tasks =<br />
<br />
<pre><br />
<closure> SketchCow: I have to work on git-annex development all day (what a fate), not this, and I'm doing 7drl 24x7 all next week. Some first steps others could do:<br />
<closure> - pick a set of around 10 thousand items whose size sums to around 8 TB<br />
<closure> - build map from Item to shard. Needs to scale well to 24+ million. sql?<br />
<closure> - write ingestion script that takes an item and generates a tarball of its non-derived files. Needs to be able to reproduce the same checksum each time run on an (unmodified) item. I know how to make tar and gz reproducible, BTW<br />
<closure> - write client registration backend, which generates the client's ssh private key, git-annex UUID, and sends them to the client (somehow tied to IA library cards?)<br />
<closure> - client runtime environment (docker image maybe?) with warrior-like interface<br />
<closure> (all that needs to do is configure things and get git-annex running)<br />
<closure> could someone wiki that? ta<br />
</pre></div>Yipdwhttps://wiki.archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/git-annex_implementation&diff=22125INTERNETARCHIVE.BAK/git-annex implementation2015-03-04T18:39:56Z<p>Yipdw: Created page with "h1. First tasks <pre> <closure> SketchCow: I have to work on git-annex development all day (what a fate), not this, and I'm doing 7drl 24x7 all next week. Some first steps ot..."</p>
<hr />
<div>h1. First tasks<br />
<br />
<pre><br />
<closure> SketchCow: I have to work on git-annex development all day (what a fate), not this, and I'm doing 7drl 24x7 all next week. Some first steps others could do:<br />
<closure> - pick a set of around 10 thousand items whose size sums to around 8 TB<br />
<closure> - build map from Item to shard. Needs to scale well to 24+ million. sql?<br />
<closure> - write ingestion script that takes an item and generates a tarball of its non-derived files. Needs to be able to reproduce the same checksum each time run on an (unmodified) item. I know how to make tar and gz reproducible, BTW<br />
<closure> - write client registration backend, which generates the client's ssh private key, git-annex UUID, and sends them to the client (somehow tied to IA library cards?)<br />
<closure> - client runtime environment (docker image maybe?) with warrior-like interface<br />
<closure> (all that needs to do is configure things and get git-annex running)<br />
<closure> could someone wiki that? ta<br />
</pre></div>Yipdwhttps://wiki.archiveteam.org/index.php?title=Talk:INTERNETARCHIVE.BAK&diff=22068Talk:INTERNETARCHIVE.BAK2015-03-02T05:42:52Z<p>Yipdw: /* Potential solutions to the storage problem */</p>
<hr />
<div>== A note on the end-user drives ==<br />
<br />
I feel it is really critical that the drives or directories sitting in the end-user's location be absolutely readable, as a file directory, containing the files. Even if that directory is inside a .tar or .zip or .gz file. Making it into a encrypted item should not happen, unless we make a VERY SPECIFIC, and redundant channel of such a thing. --[[User:Jscott|Jscott]] 00:01, 2 March 2015 (EST)<br />
<br />
==Potential solutions to the storage problem==<br />
* [https://tahoe-lafs.org/trac/tahoe-lafs Tahoe-LAFS] - decentralized (mostly), client-side encrypted file storage grid<br />
** Requires central introducer and possibly gateway nodes<br />
** Any storage node could perform a Sybil attack until a feature for client-side storage node choice is added to Tahoe.<br />
* [http://git-annex.branchable.com/ git-annex] - allows tracking copies of files in git without them being stored in a repository<br />
** Also provides a way to know what sources exist for a given item. git-annex is not (AFAIK) locked to any specific storage medium. -- yipdw<br />
<br />
==Other anticipated problems==<br />
* Users tampering with data - how do we know data a user stored has not been modified since it was pulled from IA?<br />
** Proposed solution: have multiple people make their own collection of checksums of IA files. --[[User:Mhazinsk|Mhazinsk]] 00:10, 2 March 2015 (EST)<br />
* "Dark" items (e.g. the "Internet Records" collection)<br />
** There are classifications of items within the Archive that should be considered for later waves, and not this initial effort. That includes dark items, television, and others.</div>Yipdwhttps://wiki.archiveteam.org/index.php?title=Talk:INTERNETARCHIVE.BAK&diff=22067Talk:INTERNETARCHIVE.BAK2015-03-02T05:42:39Z<p>Yipdw: /* Potential solutions to the storage problem */</p>
<hr />
<div>== A note on the end-user drives ==<br />
<br />
I feel it is really critical that the drives or directories sitting in the end-user's location be absolutely readable, as a file directory, containing the files. Even if that directory is inside a .tar or .zip or .gz file. Making it into a encrypted item should not happen, unless we make a VERY SPECIFIC, and redundant channel of such a thing. --[[User:Jscott|Jscott]] 00:01, 2 March 2015 (EST)<br />
<br />
==Potential solutions to the storage problem==<br />
* [https://tahoe-lafs.org/trac/tahoe-lafs Tahoe-LAFS] - decentralized (mostly), client-side encrypted file storage grid<br />
** Requires central introducer and possibly gateway nodes<br />
** Any storage node could perform a Sybil attack until a feature for client-side storage node choice is added to Tahoe.<br />
* [http://git-annex.branchable.com/ git-annex] - allows tracking copies of files in git without them being stored in a repository<br />
** Also provides a way to know what sources exist for a given item. git-annex is not (AFAIK) locked to any specific storage medium.<br />
<br />
==Other anticipated problems==<br />
* Users tampering with data - how do we know data a user stored has not been modified since it was pulled from IA?<br />
** Proposed solution: have multiple people make their own collection of checksums of IA files. --[[User:Mhazinsk|Mhazinsk]] 00:10, 2 March 2015 (EST)<br />
* "Dark" items (e.g. the "Internet Records" collection)<br />
** There are classifications of items within the Archive that should be considered for later waves, and not this initial effort. That includes dark items, television, and others.</div>Yipdwhttps://wiki.archiveteam.org/index.php?title=Valhalla&diff=20219Valhalla2014-09-21T17:30:33Z<p>Yipdw: </p>
<hr />
<div>[[Image:Ms internet on a disc.jpg|300px|right]]<br />
This wiki page is a collection of ideas for Project '''Valhalla'''.<br />
<br />
This project/discussion has come around because there is a class of data currently existing, several times a year, as a massive amount of data with "large, but nominal" status within the Internet Archive. The largest example is currently MobileMe, which is hundreds of terabytes in the Internet Archive system (and in need of WARC conversion), which represents a cost amount far outstripping its use. Another is TwitPic, which is currently available (and might continue to be available) but which has shown itself to be a bad actor with regards to longevity and predictability for its sunset. <br />
<br />
Therefore, there is an argument that there could be a "third place" that data collected by Archive Team could sit, until the Internet Archive (or another entity) grows its coffers/storage enough that 80-100tb is "no big deal", just like 1tb of data was annoying in 2009 and now is totally understandable for the value, i.e. Geocities. <br />
<br />
This is for short-term (or potentially also long-term) storage options, say five years or less, of data generated by Archive Team.<br />
<br />
* What options are out there, generally?<br />
* What are the costs, roughly?<br />
* What are the positives and negatives?<br />
<br />
There has been a lot of study in this area over the years, of course, so links to known authorities and debates will be welcome as well.<br />
<br />
Join the discussion in [irc://irc.efnet.org/huntinggrounds #huntinggrounds].<br />
<br />
== Goals ==<br />
<br />
We want to:<br />
<br />
* Dump an unlimited<ref>Unlimited doesn't mean infinite, but it does mean that we shouldn't worry about running out of space. We won't be the only expanding data store.</ref> amount of data into something.<br />
* Recover that data at any point.<br />
<br />
We do not care about:<br />
<br />
* Immediate or continuous availability.<br />
<br />
We absolutely require:<br />
<br />
* Low (ideally, zero) human time for maintenance. If we have substantial human maintenance needs, we're probably going to need a Committee of Elders or something.<br />
* Data integrity. The storage medium must be impossibly durable or make it inexpensive/easy to copy and verify the data onto a fresh medium.<br />
<br />
It would be nice to have:<br />
<br />
* No special environmental requirements that could not be handled by a third party. (So nobody in Archive Team would have to set up some sort of climate-controlled data-cave; however, if this is already something that e.g. IA does and they are willing to lease space, that's cool.)<br />
<br />
== What does the Internet Archive do for this Situation, Anyway? ==<br />
<br />
''This section has not been cleared by the Internet Archive, and so should be considered a rough sketch.''<br />
<br />
The Internet Archive primarily wants "access" to the data it stores, so the primary storage methodology is spinning hard drives connected to a high-speed connection from multiple locations. These hard drives are between 4-6tb (as of 2014) and are of general grade, as is most of the hardware - the theory is that replacing cheap hardware is better than spending a lot of money on super-grade hardware (whatever that may be) and not being able to make the dollars stretch. Hundreds of drives die in a month and the resiliency of the system allows them all to hot-swap in replacements. <br />
<br />
There are multiple warehouses for storing the original books that are scanned, as well as materials like CD-ROMs and even hard drives. There are collections of tapes and CD-ROMs from previous iterations of storage, although they are thought of as drop-dead options instead of long-term archival storage - the preference is, first and foremost, the spinning hard drives.<br />
<br />
The Archive does not generally use tape technology, having run into the classic "whoops, no tape drive on earth reads these any more" and "whoops, this tape no longer works properly".<br />
<br />
The Archive has indicated that if Archive Team uses a physical storage method, such as tapes, paper, hard drives or anything else, that they are willing to store these materials "as long as they are exceedingly labelled".<br />
<br />
== Options ==<br />
{| class="wikitable sortable"<br />
! Storage type<br />
! Cost ($/TB/year)<br />
! Storage density (m³/TB)<br />
! Theoretical lifespan<br />
! Practical, tested lifespan<br />
! Notes<br />
|-<br />
| Hard drives (simple distributed pool)<br />
| $150 (full cost of best reasonable 1TB+ external HD)<br />
| <br />
| <br />
| <br />
| September 2014, best reasonable 1TB+ external HD is [http://thewirecutter.com/reviews/the-best-external-desktop-hard-drive/ a 4TB WD]. 25+ pool members would need one HD each plus a computer plus software to distribute data across the entire pool.<br />
|-<br />
| Hard drives (dedicated distributed pool)<br />
| <br />
| <br />
| <br />
| <br />
| An off-the-shelf or otherwise specified, dedicated, network storage device used exclusively as part of a distributed pool.<br />
|-<br />
| Hard drives (SPOF) <ref>The [[Internet Archive]]'s cost per TB, with 24/7 online hard drives, is approximately $2000 for forever.</ref><br />
| $62 (but you have to buy 180TB)<br />
| <br />
| <br />
| <br />
| For a single location to provide all storage needs, building a [https://www.backblaze.com/blog/backblaze-storage-pod-4/ Backblaze Storage Pod 4.0] runs an average of $11,000, providing 180TB of [http://bioteam.net/2011/08/why-you-should-never-build-a-backblaze-pod/ non-redundant, not-highly-available] storage. (You really want more than one pod mirroring your data, but this is the most effective way to get that much storage in one place.)<br />
|-<br />
| Commercial / archival-grade tapes<br />
| <br />
| <br />
| <br />
| <br />
| <br />
|-<br />
| Consumer tape systems (VHS, Betamax, cassette tapes, ...)<br />
| <br />
| <br />
| <br />
| <br />
| <br />
|-<br />
| Vinyl<br />
| <br />
| <br />
| <br />
| <br />
| <br />
|-<br />
| [http://www.ollydbg.de/Paperbak/index.html PaperBack]<br />
| <br />
| <br />
| <br />
| <br />
| 500KB per letter sheet means 1TB is 2,199,024 sheets, or ~4400 reams (500 sheets each), or an 8'x16' room filled with 6' tall stacks.<br />
|-<br />
| [http://ronja.twibright.com/optar/ Optar]<br />
| <br />
| <br />
| <br />
| <br />
| At 200KB per page, this has less than half the storage density of Paperback.<br />
|-<br />
| Blu-Ray<br />
| $40 (50 pack spindle of 25GB BD-Rs)<br />
| <br />
| 30 years<ref>On the basis of the described studies and assuming adequate consideration of the specified conditions for storage and handling, as well as verification of data after writing, we estimate the Imation CD, DVD or Blu-ray media to have a theoretical readability of up to 30 years. The primary caveat is how you handle and store the media. http://support.tdkperformance.com/app/answers/detail/a_id/1685/~/life-expectancy-of-optical-media </ref><br />
| <br />
| Lasts a LOT longer than CD/DVD, but should not be assumed to last more than a decade. [http://arstechnica.com/information-technology/2014/01/why-facebook-thinks-blu-ray-discs-are-perfect-for-the-data-center/ Raidz3 with Blu-rays Doing a backup in groups of 15 disks]. Comes to under $.04/GB which is cheap, and low initial investment (drives) too!<br><br />
<br>Specifically, a 50pack spindle of 25GB BD-Rs could readily hold 1TB of data for $30-50 per spindle. 50GB and 100GB discs are more expensive per GB.<br />
|-<br />
| [http://en.wikipedia.org/wiki/M-DISC M-DISC]<br />
| <br />
| <br />
| <br />
| <br />
| Unproven technology, but potentially interesting.<br />
|-<br />
| Flash media<br />
| <br />
| <br />
| <br />
| <br />
| Wears out quickly, not-so-good long term storage. Soliciting donations for old flash media from people, or sponsorship from flash companies?<br />
|-<br />
| Glass/metal etching<br />
| <br />
| <br />
| <br />
| <br />
| <br />
|-<br />
| Amazon Glacier<br />
| $122.88 (storage only, retrieval billed separately)<br />
| <br />
| average annual durability of 99.999999999% <ref>"Amazon Glacier is designed to provide average annual durability of 99.999999999% for an archive. The service redundantly stores data in multiple facilities and on multiple devices within each facility. To increase durability, Amazon Glacier synchronously stores your data across multiple facilities before returning SUCCESS on uploading archives. Glacier performs regular, systematic data integrity checks and is built to be automatically self-healing." Maciej Ceglowski thinks that's [https://blog.pinboard.in/2014/04/cloudy_snake_oil/ kinda bullshit compared to the failure events you don't plan for], of course.</ref><br />
| <br />
| Retrieval is billed separately. 5% or less per month into S3 is free (5% of 100TB is 5TB), and data can be copied out from S3 to a SATA HD for $2.50/hr. plus media handling and shipping fees. Downloading 5TB from S3 would cost $614.40 (~$122.88/TB), but only $44.82 to transfer to HD via USB 3 or SATA (USB 2 is slower).<br />
|-<br />
| Dropbox for Business<br />
| $160* ($795/year)<br />
| <br />
| <br />
| <br />
| Dropbox for Business provides a shared pool of 1TB per user, at $795/year (five user minimum, 5TB), and $125 each additional user/year.<br />
|-<br />
| Box.com for Business<br />
| $180* ("unlimited" storage for $900/year)<br />
| <br />
| <br />
| <br />
| Box.com for Business provides "unlimited" storage at $15/user/month, five user minimum, or $900/year.<br />
|-<br />
| Dedicated colocated storage servers<br />
| $100* (e.g. $1300 for one year of 12TB rackmount server rental)<br />
|<br />
|<br />
|<br />
| Rent [http://www.ovh.com/us/dedicated-servers/storage/ storage servers from managed hosting colocation providers], and pool data across them. Benefits include bandwidth and electricity being included in the cost, and files could be made available online immediately. Negatives include needing to administer tens of servers.<br />
|-<br />
| Tahoe-LAFS<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|}<br />
<br />
== Non-options ==<br />
* Ink-based Consumer Optical Media (CDs, DVD, etc.) <br />
** Differences between Blu-Ray and DVD? DVDs do not last very long. The fact is, the history of optical writable media has been on of chicanery, failure, and overpromising while under-delivering. Some DVDs failed within a year. There are claims Blu-Ray is different, but fool me 3,504 times, shame on me.<br />
* BitTorrent Sync<br />
** Proprietary (currently), so not a good idea to use as an archival format/platform<br />
* Amazon S3 / Google Cloud Storage / Microsoft Azure Storage<br />
** Amazon S3 might be a viable waypoint for intra-month storage ($30.68/TB), but retrieval over the internet, as with Glacier, is expensive, $8499.08 for 100TB. Google's and Microsoft's offerings are all in the same price range.<br />
* Floppies<br />
** ''"Because 1.4 trillion floppies exists less than 700 billion floppies. HYPOTHETICALLY, if you set twenty stacks side by side, figure a quarter centimeter per floppy thickness, excluded the size of the drive needed to read the floppies you would still need a structure 175,000 ft. high to house them. Let's also assume that the failure rate for floppies is about 5% (everyone knows that varies by brand, usage, time of manufacture, materials used, etc, but lets say 5% per year). 70 million of those 1.4 trillion floppies are unusuable. Figuring 1.4 MB per floppy disk, you are losing approximately 100MB of porn each year. Assuming it takes 5 seconds to replace a bad floppy, you would have to spend 97,222 hrs/yr to replace them. Considering there are only 8,760 hrs per year, you would require a staff of 12 people replacing floppies around the clock or 24 people on 12 hr shifts. Figuring $7/hr you would spend $367,920 on labor alone. Figuring a nickel per bad floppy, you would need $3,500,000 annually in floppy disks, bringing your 1TB floppy raid operating costs (excluding electricity, etc) to $3,867, 920 and a whole landfill of corrupted porn. Thank you for destroying the planet and bankrupting a small country with your floppy based porn RAID."'' ([http://gizmodo.com/5431497/why-its-better-to-pretend-you-dont-know-anything-about-computers?comment=17793028#comments source])<br />
<br />
== From IRC ==<br />
<br />
<Drevkevac> we are looking to store 100TB+ of media offline for 25+ years<br />
<Drevkevac> if anyone wants to drop in, I will pastebin the chat log<br />
<rat> DVDR and BR-R are not high volume. When you have massive amounts of data, raid arrays have too many points of failure.<br />
<rat> Drevkevac: I work in a tv studio. We have 30+ years worth of tapes. And all of them are still good.<br />
<rat> find a hard drive from 30 years ago and see how well it hooks up ;)<br />
<brousch_> 1500 Taiyo Yuden Gold CD-Rs http://www.mediasupply.com/taiyo-yuden-gold-cd-rs.html<br />
<br />
<Drevkevac> still, if its true, you could do, perhaps, raidz3s in groups of 15 disks or so?<br />
<SketchCow> Please add paperbak to the wiki page.<br />
<SketchCow> Fuck Optical Media. not an option;.<br />
<Drevkevac> that would give you ~300GB per disk group, with 3 disks<br />
<br />
== Where are you going to put it? ==<br />
<br />
Okay, so you have the tech. Now you need a place for it to live.<br />
<br />
Possibilities:<br />
<br />
* The Internet Archive Physical Warehouse, Richmond, CA<br />
** The Internet Archive has several physical storage facilities, including warehouses in Richmond, CA (home of the Physical Archive) and the main location in San Francisco, CA. They have indicated they are willing to take copies of Archive Team-sponsored physical materials with the intent of them being ingested into the Archive at large over time, as costs lower and 100tb collections are not as big a drain (or a rash of funding arrives elsewhere).<br />
<br />
* Living Computer Museum, Seattle, WA<br />
** In discussions with Jason Scott, the Living Computer Museum has indicated they will have physical storage available for computer historical materials. Depending on the items being saved by Archive Team, they may be willing to host/hold copies for the forseable future.<br />
<br />
* Library of Congress, Washington, DC<br />
** The Library of Congress may be willing to take a donation of physical storage, although it is not indicated what they may do long-term with it.<br />
<br />
Multiple copies would of course be great.<br />
<br />
== Project-specific suggestions ==<br />
<br />
=== Twitch.tv (and other video services) ===<br />
<br />
* Keep the original video files in (semi-)offline storage, and store transcoded (compressed) versions on the Internet Archive.<br />
<br />
== See Also ==<br />
*[[Storage Media]]<br />
<br />
== References ==<br />
<references/><br />
<br />
{{Navigation box}}</div>Yipdwhttps://wiki.archiveteam.org/index.php?title=Valhalla&diff=20218Valhalla2014-09-21T17:27:36Z<p>Yipdw: </p>
<hr />
<div>[[Image:Ms internet on a disc.jpg|300px|right]]<br />
This wiki page is a collection of ideas for Project '''Valhalla'''.<br />
<br />
This project/discussion has come around because there is a class of data currently existing, several times a year, as a massive amount of data with "large, but nominal" status within the Internet Archive. The largest example is currently MobileMe, which is hundreds of terabytes in the Internet Archive system (and in need of WARC conversion), which represents a cost amount far outstripping its use. Another is TwitPic, which is currently available (and might continue to be available) but which has shown itself to be a bad actor with regards to longevity and predictability for its sunset. <br />
<br />
Therefore, there is an argument that there could be a "third place" that data collected by Archive Team could sit, until the Internet Archive (or another entity) grows its coffers/storage enough that 80-100tb is "no big deal", just like 1tb of data was annoying in 2009 and now is totally understandable for the value, i.e. Geocities. <br />
<br />
This is for short-term (or potentially also long-term) storage options, say five years or less, of data generated by Archive Team.<br />
<br />
* What options are out there, generally?<br />
* What are the costs, roughly?<br />
* What are the positives and negatives?<br />
<br />
There has been a lot of study in this area over the years, of course, so links to known authorities and debates will be welcome as well.<br />
<br />
Join the discussion in [irc://irc.efnet.org/huntinggrounds #huntinggrounds].<br />
<br />
== Goals ==<br />
<br />
We want to:<br />
<br />
* Dump an unlimited<ref>Unlimited doesn't mean infinite, but it does mean that we shouldn't worry about running out of space. We won't be the only expanding data store.</ref> amount of data into something.<br />
* Recover that data at any point.<br />
<br />
We do not care about:<br />
<br />
* Immediate or continuous availability.<br />
<br />
We absolutely require:<br />
<br />
* Low (ideally, zero) human time for maintenance.<br />
* Data integrity. The storage medium must be impossibly durable or make it inexpensive/easy to copy and verify the data onto a fresh medium.<br />
<br />
It would be nice to have:<br />
<br />
* No special environmental conditions that could not be handled by a third party. (So nobody in Archive Team would have to set up some sort of climate-controlled data-cave; however, if this is already something that e.g. IA does and they are willing to lease space, that's cool.)<br />
<br />
== What does the Internet Archive do for this Situation, Anyway? ==<br />
<br />
''This section has not been cleared by the Internet Archive, and so should be considered a rough sketch.''<br />
<br />
The Internet Archive primarily wants "access" to the data it stores, so the primary storage methodology is spinning hard drives connected to a high-speed connection from multiple locations. These hard drives are between 4-6tb (as of 2014) and are of general grade, as is most of the hardware - the theory is that replacing cheap hardware is better than spending a lot of money on super-grade hardware (whatever that may be) and not being able to make the dollars stretch. Hundreds of drives die in a month and the resiliency of the system allows them all to hot-swap in replacements. <br />
<br />
There are multiple warehouses for storing the original books that are scanned, as well as materials like CD-ROMs and even hard drives. There are collections of tapes and CD-ROMs from previous iterations of storage, although they are thought of as drop-dead options instead of long-term archival storage - the preference is, first and foremost, the spinning hard drives.<br />
<br />
The Archive does not generally use tape technology, having run into the classic "whoops, no tape drive on earth reads these any more" and "whoops, this tape no longer works properly".<br />
<br />
The Archive has indicated that if Archive Team uses a physical storage method, such as tapes, paper, hard drives or anything else, that they are willing to store these materials "as long as they are exceedingly labelled".<br />
<br />
== Options ==<br />
{| class="wikitable sortable"<br />
! Storage type<br />
! Cost ($/TB/year)<br />
! Storage density (m³/TB)<br />
! Theoretical lifespan<br />
! Practical, tested lifespan<br />
! Notes<br />
|-<br />
| Hard drives (simple distributed pool)<br />
| $150 (full cost of best reasonable 1TB+ external HD)<br />
| <br />
| <br />
| <br />
| September 2014, best reasonable 1TB+ external HD is [http://thewirecutter.com/reviews/the-best-external-desktop-hard-drive/ a 4TB WD]. 25+ pool members would need one HD each plus a computer plus software to distribute data across the entire pool.<br />
|-<br />
| Hard drives (dedicated distributed pool)<br />
| <br />
| <br />
| <br />
| <br />
| An off-the-shelf or otherwise specified, dedicated, network storage device used exclusively as part of a distributed pool.<br />
|-<br />
| Hard drives (SPOF) <ref>The [[Internet Archive]]'s cost per TB, with 24/7 online hard drives, is approximately $2000 for forever.</ref><br />
| $62 (but you have to buy 180TB)<br />
| <br />
| <br />
| <br />
| For a single location to provide all storage needs, building a [https://www.backblaze.com/blog/backblaze-storage-pod-4/ Backblaze Storage Pod 4.0] runs an average of $11,000, providing 180TB of [http://bioteam.net/2011/08/why-you-should-never-build-a-backblaze-pod/ non-redundant, not-highly-available] storage. (You really want more than one pod mirroring your data, but this is the most effective way to get that much storage in one place.)<br />
|-<br />
| Commercial / archival-grade tapes<br />
| <br />
| <br />
| <br />
| <br />
| <br />
|-<br />
| Consumer tape systems (VHS, Betamax, cassette tapes, ...)<br />
| <br />
| <br />
| <br />
| <br />
| <br />
|-<br />
| Vinyl<br />
| <br />
| <br />
| <br />
| <br />
| <br />
|-<br />
| [http://www.ollydbg.de/Paperbak/index.html PaperBack]<br />
| <br />
| <br />
| <br />
| <br />
| 500KB per letter sheet means 1TB is 2,199,024 sheets, or ~4400 reams (500 sheets each), or an 8'x16' room filled with 6' tall stacks.<br />
|-<br />
| [http://ronja.twibright.com/optar/ Optar]<br />
| <br />
| <br />
| <br />
| <br />
| At 200KB per page, this has less than half the storage density of Paperback.<br />
|-<br />
| Blu-Ray<br />
| $40 (50 pack spindle of 25GB BD-Rs)<br />
| <br />
| 30 years<ref>On the basis of the described studies and assuming adequate consideration of the specified conditions for storage and handling, as well as verification of data after writing, we estimate the Imation CD, DVD or Blu-ray media to have a theoretical readability of up to 30 years. The primary caveat is how you handle and store the media. http://support.tdkperformance.com/app/answers/detail/a_id/1685/~/life-expectancy-of-optical-media </ref><br />
| <br />
| Lasts a LOT longer than CD/DVD, but should not be assumed to last more than a decade. [http://arstechnica.com/information-technology/2014/01/why-facebook-thinks-blu-ray-discs-are-perfect-for-the-data-center/ Raidz3 with Blu-rays Doing a backup in groups of 15 disks]. Comes to under $.04/GB which is cheap, and low initial investment (drives) too!<br><br />
<br>Specifically, a 50pack spindle of 25GB BD-Rs could readily hold 1TB of data for $30-50 per spindle. 50GB and 100GB discs are more expensive per GB.<br />
|-<br />
| [http://en.wikipedia.org/wiki/M-DISC M-DISC]<br />
| <br />
| <br />
| <br />
| <br />
| Unproven technology, but potentially interesting.<br />
|-<br />
| Flash media<br />
| <br />
| <br />
| <br />
| <br />
| Wears out quickly, not-so-good long term storage. Soliciting donations for old flash media from people, or sponsorship from flash companies?<br />
|-<br />
| Glass/metal etching<br />
| <br />
| <br />
| <br />
| <br />
| <br />
|-<br />
| Amazon Glacier<br />
| $122.88 (storage only, retrieval billed separately)<br />
| <br />
| average annual durability of 99.999999999% <ref>"Amazon Glacier is designed to provide average annual durability of 99.999999999% for an archive. The service redundantly stores data in multiple facilities and on multiple devices within each facility. To increase durability, Amazon Glacier synchronously stores your data across multiple facilities before returning SUCCESS on uploading archives. Glacier performs regular, systematic data integrity checks and is built to be automatically self-healing." Maciej Ceglowski thinks that's [https://blog.pinboard.in/2014/04/cloudy_snake_oil/ kinda bullshit compared to the failure events you don't plan for], of course.</ref><br />
| <br />
| Retrieval is billed separately. 5% or less per month into S3 is free (5% of 100TB is 5TB), and data can be copied out from S3 to a SATA HD for $2.50/hr. plus media handling and shipping fees. Downloading 5TB from S3 would cost $614.40 (~$122.88/TB), but only $44.82 to transfer to HD via USB 3 or SATA (USB 2 is slower).<br />
|-<br />
| Dropbox for Business<br />
| $160* ($795/year)<br />
| <br />
| <br />
| <br />
| Dropbox for Business provides a shared pool of 1TB per user, at $795/year (five user minimum, 5TB), and $125 each additional user/year.<br />
|-<br />
| Box.com for Business<br />
| $180* ("unlimited" storage for $900/year)<br />
| <br />
| <br />
| <br />
| Box.com for Business provides "unlimited" storage at $15/user/month, five user minimum, or $900/year.<br />
|-<br />
| Dedicated colocated storage servers<br />
| $100* (e.g. $1300 for one year of 12TB rackmount server rental)<br />
|<br />
|<br />
|<br />
| Rent [http://www.ovh.com/us/dedicated-servers/storage/ storage servers from managed hosting colocation providers], and pool data across them. Benefits include bandwidth and electricity being included in the cost, and files could be made available online immediately. Negatives include needing to administer tens of servers.<br />
|-<br />
| Tahoe-LAFS<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|}<br />
<br />
== Non-options ==<br />
* Ink-based Consumer Optical Media (CDs, DVD, etc.) <br />
** Differences between Blu-Ray and DVD? DVDs do not last very long. The fact is, the history of optical writable media has been on of chicanery, failure, and overpromising while under-delivering. Some DVDs failed within a year. There are claims Blu-Ray is different, but fool me 3,504 times, shame on me.<br />
* BitTorrent Sync<br />
** Proprietary (currently), so not a good idea to use as an archival format/platform<br />
* Amazon S3 / Google Cloud Storage / Microsoft Azure Storage<br />
** Amazon S3 might be a viable waypoint for intra-month storage ($30.68/TB), but retrieval over the internet, as with Glacier, is expensive, $8499.08 for 100TB. Google's and Microsoft's offerings are all in the same price range.<br />
* Floppies<br />
** ''"Because 1.4 trillion floppies exists less than 700 billion floppies. HYPOTHETICALLY, if you set twenty stacks side by side, figure a quarter centimeter per floppy thickness, excluded the size of the drive needed to read the floppies you would still need a structure 175,000 ft. high to house them. Let's also assume that the failure rate for floppies is about 5% (everyone knows that varies by brand, usage, time of manufacture, materials used, etc, but lets say 5% per year). 70 million of those 1.4 trillion floppies are unusuable. Figuring 1.4 MB per floppy disk, you are losing approximately 100MB of porn each year. Assuming it takes 5 seconds to replace a bad floppy, you would have to spend 97,222 hrs/yr to replace them. Considering there are only 8,760 hrs per year, you would require a staff of 12 people replacing floppies around the clock or 24 people on 12 hr shifts. Figuring $7/hr you would spend $367,920 on labor alone. Figuring a nickel per bad floppy, you would need $3,500,000 annually in floppy disks, bringing your 1TB floppy raid operating costs (excluding electricity, etc) to $3,867, 920 and a whole landfill of corrupted porn. Thank you for destroying the planet and bankrupting a small country with your floppy based porn RAID."'' ([http://gizmodo.com/5431497/why-its-better-to-pretend-you-dont-know-anything-about-computers?comment=17793028#comments source])<br />
<br />
== From IRC ==<br />
<br />
<Drevkevac> we are looking to store 100TB+ of media offline for 25+ years<br />
<Drevkevac> if anyone wants to drop in, I will pastebin the chat log<br />
<rat> DVDR and BR-R are not high volume. When you have massive amounts of data, raid arrays have too many points of failure.<br />
<rat> Drevkevac: I work in a tv studio. We have 30+ years worth of tapes. And all of them are still good.<br />
<rat> find a hard drive from 30 years ago and see how well it hooks up ;)<br />
<brousch_> 1500 Taiyo Yuden Gold CD-Rs http://www.mediasupply.com/taiyo-yuden-gold-cd-rs.html<br />
<br />
<Drevkevac> still, if its true, you could do, perhaps, raidz3s in groups of 15 disks or so?<br />
<SketchCow> Please add paperbak to the wiki page.<br />
<SketchCow> Fuck Optical Media. not an option;.<br />
<Drevkevac> that would give you ~300GB per disk group, with 3 disks<br />
<br />
== Where are you going to put it? ==<br />
<br />
Okay, so you have the tech. Now you need a place for it to live.<br />
<br />
Possibilities:<br />
<br />
* The Internet Archive Physical Warehouse, Richmond, CA<br />
** The Internet Archive has several physical storage facilities, including warehouses in Richmond, CA (home of the Physical Archive) and the main location in San Francisco, CA. They have indicated they are willing to take copies of Archive Team-sponsored physical materials with the intent of them being ingested into the Archive at large over time, as costs lower and 100tb collections are not as big a drain (or a rash of funding arrives elsewhere).<br />
<br />
* Living Computer Museum, Seattle, WA<br />
** In discussions with Jason Scott, the Living Computer Museum has indicated they will have physical storage available for computer historical materials. Depending on the items being saved by Archive Team, they may be willing to host/hold copies for the forseable future.<br />
<br />
* Library of Congress, Washington, DC<br />
** The Library of Congress may be willing to take a donation of physical storage, although it is not indicated what they may do long-term with it.<br />
<br />
Multiple copies would of course be great.<br />
<br />
== Project-specific suggestions ==<br />
<br />
=== Twitch.tv (and other video services) ===<br />
<br />
* Keep the original video files in (semi-)offline storage, and store transcoded (compressed) versions on the Internet Archive.<br />
<br />
== See Also ==<br />
*[[Storage Media]]<br />
<br />
== References ==<br />
<references/><br />
<br />
{{Navigation box}}</div>Yipdwhttps://wiki.archiveteam.org/index.php?title=Valhalla&diff=20217Valhalla2014-09-21T17:25:50Z<p>Yipdw: </p>
<hr />
<div>[[Image:Ms internet on a disc.jpg|300px|right]]<br />
This wiki page is a collection of ideas for Project '''Valhalla'''.<br />
<br />
This project/discussion has come around because there is a class of data currently existing, several times a year, as a massive amount of data with "large, but nominal" status within the Internet Archive. The largest example is currently MobileMe, which is hundreds of terabytes in the Internet Archive system (and in need of WARC conversion), which represents a cost amount far outstripping its use. Another is TwitPic, which is currently available (and might continue to be available) but which has shown itself to be a bad actor with regards to longevity and predictability for its sunset. <br />
<br />
Therefore, there is an argument that there could be a "third place" that data collected by Archive Team could sit, until the Internet Archive (or another entity) grows its coffers/storage enough that 80-100tb is "no big deal", just like 1tb of data was annoying in 2009 and now is totally understandable for the value, i.e. Geocities. <br />
<br />
This is for short-term (or potentially also long-term) storage options, say five years or less, of data generated by Archive Team.<br />
<br />
* What options are out there, generally?<br />
* What are the costs, roughly?<br />
* What are the positives and negatives?<br />
<br />
There has been a lot of study in this area over the years, of course, so links to known authorities and debates will be welcome as well.<br />
<br />
Join the discussion in [irc://irc.efnet.org/huntinggrounds #huntinggrounds].<br />
<br />
== Goals ==<br />
<br />
We want to:<br />
<br />
* Dump an unlimited<ref>Yes, unlimited means infinite. This is one thing that makes this hard. Take "impossible" to slashdot.</ref> amount of data into something.<br />
* Recover that data at any point.<br />
<br />
We do not care about:<br />
<br />
* Immediate or continuous availability.<br />
<br />
We absolutely require:<br />
<br />
* Low (ideally, zero) human time for maintenance.<br />
* Data integrity. The storage medium must be impossibly durable or make it inexpensive/easy to copy and verify the data onto a fresh medium.<br />
<br />
It would be nice to have:<br />
<br />
* No special environmental conditions that could not be handled by a third party. (So nobody in Archive Team would have to set up some sort of climate-controlled data-cave; however, if this is already something that e.g. IA does and they are willing to lease space, that's cool.)<br />
<br />
== What does the Internet Archive do for this Situation, Anyway? ==<br />
<br />
''This section has not been cleared by the Internet Archive, and so should be considered a rough sketch.''<br />
<br />
The Internet Archive primarily wants "access" to the data it stores, so the primary storage methodology is spinning hard drives connected to a high-speed connection from multiple locations. These hard drives are between 4-6tb (as of 2014) and are of general grade, as is most of the hardware - the theory is that replacing cheap hardware is better than spending a lot of money on super-grade hardware (whatever that may be) and not being able to make the dollars stretch. Hundreds of drives die in a month and the resiliency of the system allows them all to hot-swap in replacements. <br />
<br />
There are multiple warehouses for storing the original books that are scanned, as well as materials like CD-ROMs and even hard drives. There are collections of tapes and CD-ROMs from previous iterations of storage, although they are thought of as drop-dead options instead of long-term archival storage - the preference is, first and foremost, the spinning hard drives.<br />
<br />
The Archive does not generally use tape technology, having run into the classic "whoops, no tape drive on earth reads these any more" and "whoops, this tape no longer works properly".<br />
<br />
The Archive has indicated that if Archive Team uses a physical storage method, such as tapes, paper, hard drives or anything else, that they are willing to store these materials "as long as they are exceedingly labelled".<br />
<br />
== Options ==<br />
{| class="wikitable sortable"<br />
! Storage type<br />
! Cost ($/TB/year)<br />
! Storage density (m³/TB)<br />
! Theoretical lifespan<br />
! Practical, tested lifespan<br />
! Notes<br />
|-<br />
| Hard drives (simple distributed pool)<br />
| $150 (full cost of best reasonable 1TB+ external HD)<br />
| <br />
| <br />
| <br />
| September 2014, best reasonable 1TB+ external HD is [http://thewirecutter.com/reviews/the-best-external-desktop-hard-drive/ a 4TB WD]. 25+ pool members would need one HD each plus a computer plus software to distribute data across the entire pool.<br />
|-<br />
| Hard drives (dedicated distributed pool)<br />
| <br />
| <br />
| <br />
| <br />
| An off-the-shelf or otherwise specified, dedicated, network storage device used exclusively as part of a distributed pool.<br />
|-<br />
| Hard drives (SPOF) <ref>The [[Internet Archive]]'s cost per TB, with 24/7 online hard drives, is approximately $2000 for forever.</ref><br />
| $62 (but you have to buy 180TB)<br />
| <br />
| <br />
| <br />
| For a single location to provide all storage needs, building a [https://www.backblaze.com/blog/backblaze-storage-pod-4/ Backblaze Storage Pod 4.0] runs an average of $11,000, providing 180TB of [http://bioteam.net/2011/08/why-you-should-never-build-a-backblaze-pod/ non-redundant, not-highly-available] storage. (You really want more than one pod mirroring your data, but this is the most effective way to get that much storage in one place.)<br />
|-<br />
| Commercial / archival-grade tapes<br />
| <br />
| <br />
| <br />
| <br />
| <br />
|-<br />
| Consumer tape systems (VHS, Betamax, cassette tapes, ...)<br />
| <br />
| <br />
| <br />
| <br />
| <br />
|-<br />
| Vinyl<br />
| <br />
| <br />
| <br />
| <br />
| <br />
|-<br />
| [http://www.ollydbg.de/Paperbak/index.html PaperBack]<br />
| <br />
| <br />
| <br />
| <br />
| 500KB per letter sheet means 1TB is 2,199,024 sheets, or ~4400 reams (500 sheets each), or an 8'x16' room filled with 6' tall stacks.<br />
|-<br />
| [http://ronja.twibright.com/optar/ Optar]<br />
| <br />
| <br />
| <br />
| <br />
| At 200KB per page, this has less than half the storage density of Paperback.<br />
|-<br />
| Blu-Ray<br />
| $40 (50 pack spindle of 25GB BD-Rs)<br />
| <br />
| 30 years<ref>On the basis of the described studies and assuming adequate consideration of the specified conditions for storage and handling, as well as verification of data after writing, we estimate the Imation CD, DVD or Blu-ray media to have a theoretical readability of up to 30 years. The primary caveat is how you handle and store the media. http://support.tdkperformance.com/app/answers/detail/a_id/1685/~/life-expectancy-of-optical-media </ref><br />
| <br />
| Lasts a LOT longer than CD/DVD, but should not be assumed to last more than a decade. [http://arstechnica.com/information-technology/2014/01/why-facebook-thinks-blu-ray-discs-are-perfect-for-the-data-center/ Raidz3 with Blu-rays Doing a backup in groups of 15 disks]. Comes to under $.04/GB which is cheap, and low initial investment (drives) too!<br><br />
<br>Specifically, a 50pack spindle of 25GB BD-Rs could readily hold 1TB of data for $30-50 per spindle. 50GB and 100GB discs are more expensive per GB.<br />
|-<br />
| [http://en.wikipedia.org/wiki/M-DISC M-DISC]<br />
| <br />
| <br />
| <br />
| <br />
| Unproven technology, but potentially interesting.<br />
|-<br />
| Flash media<br />
| <br />
| <br />
| <br />
| <br />
| Wears out quickly, not-so-good long term storage. Soliciting donations for old flash media from people, or sponsorship from flash companies?<br />
|-<br />
| Glass/metal etching<br />
| <br />
| <br />
| <br />
| <br />
| <br />
|-<br />
| Amazon Glacier<br />
| $122.88 (storage only, retrieval billed separately)<br />
| <br />
| average annual durability of 99.999999999% <ref>"Amazon Glacier is designed to provide average annual durability of 99.999999999% for an archive. The service redundantly stores data in multiple facilities and on multiple devices within each facility. To increase durability, Amazon Glacier synchronously stores your data across multiple facilities before returning SUCCESS on uploading archives. Glacier performs regular, systematic data integrity checks and is built to be automatically self-healing." Maciej Ceglowski thinks that's [https://blog.pinboard.in/2014/04/cloudy_snake_oil/ kinda bullshit compared to the failure events you don't plan for], of course.</ref><br />
| <br />
| Retrieval is billed separately. 5% or less per month into S3 is free (5% of 100TB is 5TB), and data can be copied out from S3 to a SATA HD for $2.50/hr. plus media handling and shipping fees. Downloading 5TB from S3 would cost $614.40 (~$122.88/TB), but only $44.82 to transfer to HD via USB 3 or SATA (USB 2 is slower).<br />
|-<br />
| Dropbox for Business<br />
| $160* ($795/year)<br />
| <br />
| <br />
| <br />
| Dropbox for Business provides a shared pool of 1TB per user, at $795/year (five user minimum, 5TB), and $125 each additional user/year.<br />
|-<br />
| Box.com for Business<br />
| $180* ("unlimited" storage for $900/year)<br />
| <br />
| <br />
| <br />
| Box.com for Business provides "unlimited" storage at $15/user/month, five user minimum, or $900/year.<br />
|-<br />
| Dedicated colocated storage servers<br />
| $100* (e.g. $1300 for one year of 12TB rackmount server rental)<br />
|<br />
|<br />
|<br />
| Rent [http://www.ovh.com/us/dedicated-servers/storage/ storage servers from managed hosting colocation providers], and pool data across them. Benefits include bandwidth and electricity being included in the cost, and files could be made available online immediately. Negatives include needing to administer tens of servers.<br />
|-<br />
| Tahoe-LAFS<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|}<br />
<br />
== Non-options ==<br />
* Ink-based Consumer Optical Media (CDs, DVD, etc.) <br />
** Differences between Blu-Ray and DVD? DVDs do not last very long. The fact is, the history of optical writable media has been on of chicanery, failure, and overpromising while under-delivering. Some DVDs failed within a year. There are claims Blu-Ray is different, but fool me 3,504 times, shame on me.<br />
* BitTorrent Sync<br />
** Proprietary (currently), so not a good idea to use as an archival format/platform<br />
* Amazon S3 / Google Cloud Storage / Microsoft Azure Storage<br />
** Amazon S3 might be a viable waypoint for intra-month storage ($30.68/TB), but retrieval over the internet, as with Glacier, is expensive, $8499.08 for 100TB. Google's and Microsoft's offerings are all in the same price range.<br />
* Floppies<br />
** ''"Because 1.4 trillion floppies exists less than 700 billion floppies. HYPOTHETICALLY, if you set twenty stacks side by side, figure a quarter centimeter per floppy thickness, excluded the size of the drive needed to read the floppies you would still need a structure 175,000 ft. high to house them. Let's also assume that the failure rate for floppies is about 5% (everyone knows that varies by brand, usage, time of manufacture, materials used, etc, but lets say 5% per year). 70 million of those 1.4 trillion floppies are unusuable. Figuring 1.4 MB per floppy disk, you are losing approximately 100MB of porn each year. Assuming it takes 5 seconds to replace a bad floppy, you would have to spend 97,222 hrs/yr to replace them. Considering there are only 8,760 hrs per year, you would require a staff of 12 people replacing floppies around the clock or 24 people on 12 hr shifts. Figuring $7/hr you would spend $367,920 on labor alone. Figuring a nickel per bad floppy, you would need $3,500,000 annually in floppy disks, bringing your 1TB floppy raid operating costs (excluding electricity, etc) to $3,867, 920 and a whole landfill of corrupted porn. Thank you for destroying the planet and bankrupting a small country with your floppy based porn RAID."'' ([http://gizmodo.com/5431497/why-its-better-to-pretend-you-dont-know-anything-about-computers?comment=17793028#comments source])<br />
<br />
== From IRC ==<br />
<br />
<Drevkevac> we are looking to store 100TB+ of media offline for 25+ years<br />
<Drevkevac> if anyone wants to drop in, I will pastebin the chat log<br />
<rat> DVDR and BR-R are not high volume. When you have massive amounts of data, raid arrays have too many points of failure.<br />
<rat> Drevkevac: I work in a tv studio. We have 30+ years worth of tapes. And all of them are still good.<br />
<rat> find a hard drive from 30 years ago and see how well it hooks up ;)<br />
<brousch_> 1500 Taiyo Yuden Gold CD-Rs http://www.mediasupply.com/taiyo-yuden-gold-cd-rs.html<br />
<br />
<Drevkevac> still, if its true, you could do, perhaps, raidz3s in groups of 15 disks or so?<br />
<SketchCow> Please add paperbak to the wiki page.<br />
<SketchCow> Fuck Optical Media. not an option;.<br />
<Drevkevac> that would give you ~300GB per disk group, with 3 disks<br />
<br />
== Where are you going to put it? ==<br />
<br />
Okay, so you have the tech. Now you need a place for it to live.<br />
<br />
Possibilities:<br />
<br />
* The Internet Archive Physical Warehouse, Richmond, CA<br />
** The Internet Archive has several physical storage facilities, including warehouses in Richmond, CA (home of the Physical Archive) and the main location in San Francisco, CA. They have indicated they are willing to take copies of Archive Team-sponsored physical materials with the intent of them being ingested into the Archive at large over time, as costs lower and 100tb collections are not as big a drain (or a rash of funding arrives elsewhere).<br />
<br />
* Living Computer Museum, Seattle, WA<br />
** In discussions with Jason Scott, the Living Computer Museum has indicated they will have physical storage available for computer historical materials. Depending on the items being saved by Archive Team, they may be willing to host/hold copies for the forseable future.<br />
<br />
* Library of Congress, Washington, DC<br />
** The Library of Congress may be willing to take a donation of physical storage, although it is not indicated what they may do long-term with it.<br />
<br />
Multiple copies would of course be great.<br />
<br />
== Project-specific suggestions ==<br />
<br />
=== Twitch.tv (and other video services) ===<br />
<br />
* Keep the original video files in (semi-)offline storage, and store transcoded (compressed) versions on the Internet Archive.<br />
<br />
== See Also ==<br />
*[[Storage Media]]<br />
<br />
== References ==<br />
<references/><br />
<br />
{{Navigation box}}</div>Yipdwhttps://wiki.archiveteam.org/index.php?title=Valhalla&diff=20216Valhalla2014-09-21T17:24:45Z<p>Yipdw: /* Goals */</p>
<hr />
<div>[[Image:Ms internet on a disc.jpg|300px|right]]<br />
This wiki page is a collection of ideas for Project '''Valhalla'''.<br />
<br />
This project/discussion has come around because there is a class of data currently existing, several times a year, as a massive amount of data with "large, but nominal" status within the Internet Archive. The largest example is currently MobileMe, which is hundreds of terabytes in the Internet Archive system (and in need of WARC conversion), which represents a cost amount far outstripping its use. Another is TwitPic, which is currently available (and might continue to be available) but which has shown itself to be a bad actor with regards to longevity and predictability for its sunset. <br />
<br />
Therefore, there is an argument that there could be a "third place" that data collected by Archive Team could sit, until the Internet Archive (or another entity) grows its coffers/storage enough that 80-100tb is "no big deal", just like 1tb of data was annoying in 2009 and now is totally understandable for the value, i.e. Geocities. <br />
<br />
This is for short-term (or potentially also long-term) storage options, say five years or less, of data generated by Archive Team.<br />
<br />
* What options are out there, generally?<br />
* What are the costs, roughly?<br />
* What are the positives and negatives?<br />
<br />
There has been a lot of study in this area over the years, of course, so links to known authorities and debates will be welcome as well.<br />
<br />
Join the discussion in [irc://irc.efnet.org/huntinggrounds #huntinggrounds].<br />
<br />
== Goals ==<br />
<br />
We want to:<br />
<br />
* Dump an unlimited<ref>Take pedantry about "unlimited" to slashdot</ref> amount of data into something.<br />
* Recover that data at any point.<br />
<br />
We do not care about:<br />
<br />
* Immediate or continuous availability.<br />
<br />
We absolutely require:<br />
<br />
* Low (ideally, zero) human time for maintenance.<br />
* Data integrity. The storage medium must be impossibly durable or make it inexpensive/easy to copy and verify the data onto a fresh medium.<br />
<br />
It would be nice to have:<br />
<br />
* No special environmental conditions that could not be handled by a third party. (So nobody in Archive Team would have to set up some sort of climate-controlled data-cave; however, if this is already something that e.g. IA does and they are willing to lease space, that's cool.)<br />
<br />
== What does the Internet Archive do for this Situation, Anyway? ==<br />
<br />
''This section has not been cleared by the Internet Archive, and so should be considered a rough sketch.''<br />
<br />
The Internet Archive primarily wants "access" to the data it stores, so the primary storage methodology is spinning hard drives connected to a high-speed connection from multiple locations. These hard drives are between 4-6tb (as of 2014) and are of general grade, as is most of the hardware - the theory is that replacing cheap hardware is better than spending a lot of money on super-grade hardware (whatever that may be) and not being able to make the dollars stretch. Hundreds of drives die in a month and the resiliency of the system allows them all to hot-swap in replacements. <br />
<br />
There are multiple warehouses for storing the original books that are scanned, as well as materials like CD-ROMs and even hard drives. There are collections of tapes and CD-ROMs from previous iterations of storage, although they are thought of as drop-dead options instead of long-term archival storage - the preference is, first and foremost, the spinning hard drives.<br />
<br />
The Archive does not generally use tape technology, having run into the classic "whoops, no tape drive on earth reads these any more" and "whoops, this tape no longer works properly".<br />
<br />
The Archive has indicated that if Archive Team uses a physical storage method, such as tapes, paper, hard drives or anything else, that they are willing to store these materials "as long as they are exceedingly labelled".<br />
<br />
== Options ==<br />
{| class="wikitable sortable"<br />
! Storage type<br />
! Cost ($/TB/year)<br />
! Storage density (m³/TB)<br />
! Theoretical lifespan<br />
! Practical, tested lifespan<br />
! Notes<br />
|-<br />
| Hard drives (simple distributed pool)<br />
| $150 (full cost of best reasonable 1TB+ external HD)<br />
| <br />
| <br />
| <br />
| September 2014, best reasonable 1TB+ external HD is [http://thewirecutter.com/reviews/the-best-external-desktop-hard-drive/ a 4TB WD]. 25+ pool members would need one HD each plus a computer plus software to distribute data across the entire pool.<br />
|-<br />
| Hard drives (dedicated distributed pool)<br />
| <br />
| <br />
| <br />
| <br />
| An off-the-shelf or otherwise specified, dedicated, network storage device used exclusively as part of a distributed pool.<br />
|-<br />
| Hard drives (SPOF) <ref>The [[Internet Archive]]'s cost per TB, with 24/7 online hard drives, is approximately $2000 for forever.</ref><br />
| $62 (but you have to buy 180TB)<br />
| <br />
| <br />
| <br />
| For a single location to provide all storage needs, building a [https://www.backblaze.com/blog/backblaze-storage-pod-4/ Backblaze Storage Pod 4.0] runs an average of $11,000, providing 180TB of [http://bioteam.net/2011/08/why-you-should-never-build-a-backblaze-pod/ non-redundant, not-highly-available] storage. (You really want more than one pod mirroring your data, but this is the most effective way to get that much storage in one place.)<br />
|-<br />
| Commercial / archival-grade tapes<br />
| <br />
| <br />
| <br />
| <br />
| <br />
|-<br />
| Consumer tape systems (VHS, Betamax, cassette tapes, ...)<br />
| <br />
| <br />
| <br />
| <br />
| <br />
|-<br />
| Vinyl<br />
| <br />
| <br />
| <br />
| <br />
| <br />
|-<br />
| [http://www.ollydbg.de/Paperbak/index.html PaperBack]<br />
| <br />
| <br />
| <br />
| <br />
| 500KB per letter sheet means 1TB is 2,199,024 sheets, or ~4400 reams (500 sheets each), or an 8'x16' room filled with 6' tall stacks.<br />
|-<br />
| [http://ronja.twibright.com/optar/ Optar]<br />
| <br />
| <br />
| <br />
| <br />
| At 200KB per page, this has less than half the storage density of Paperback.<br />
|-<br />
| Blu-Ray<br />
| $40 (50 pack spindle of 25GB BD-Rs)<br />
| <br />
| 30 years<ref>On the basis of the described studies and assuming adequate consideration of the specified conditions for storage and handling, as well as verification of data after writing, we estimate the Imation CD, DVD or Blu-ray media to have a theoretical readability of up to 30 years. The primary caveat is how you handle and store the media. http://support.tdkperformance.com/app/answers/detail/a_id/1685/~/life-expectancy-of-optical-media </ref><br />
| <br />
| Lasts a LOT longer than CD/DVD, but should not be assumed to last more than a decade. [http://arstechnica.com/information-technology/2014/01/why-facebook-thinks-blu-ray-discs-are-perfect-for-the-data-center/ Raidz3 with Blu-rays Doing a backup in groups of 15 disks]. Comes to under $.04/GB which is cheap, and low initial investment (drives) too!<br><br />
<br>Specifically, a 50pack spindle of 25GB BD-Rs could readily hold 1TB of data for $30-50 per spindle. 50GB and 100GB discs are more expensive per GB.<br />
|-<br />
| [http://en.wikipedia.org/wiki/M-DISC M-DISC]<br />
| <br />
| <br />
| <br />
| <br />
| Unproven technology, but potentially interesting.<br />
|-<br />
| Flash media<br />
| <br />
| <br />
| <br />
| <br />
| Wears out quickly, not-so-good long term storage. Soliciting donations for old flash media from people, or sponsorship from flash companies?<br />
|-<br />
| Glass/metal etching<br />
| <br />
| <br />
| <br />
| <br />
| <br />
|-<br />
| Amazon Glacier<br />
| $122.88 (storage only, retrieval billed separately)<br />
| <br />
| average annual durability of 99.999999999% <ref>"Amazon Glacier is designed to provide average annual durability of 99.999999999% for an archive. The service redundantly stores data in multiple facilities and on multiple devices within each facility. To increase durability, Amazon Glacier synchronously stores your data across multiple facilities before returning SUCCESS on uploading archives. Glacier performs regular, systematic data integrity checks and is built to be automatically self-healing." Maciej Ceglowski thinks that's [https://blog.pinboard.in/2014/04/cloudy_snake_oil/ kinda bullshit compared to the failure events you don't plan for], of course.</ref><br />
| <br />
| Retrieval is billed separately. 5% or less per month into S3 is free (5% of 100TB is 5TB), and data can be copied out from S3 to a SATA HD for $2.50/hr. plus media handling and shipping fees. Downloading 5TB from S3 would cost $614.40 (~$122.88/TB), but only $44.82 to transfer to HD via USB 3 or SATA (USB 2 is slower).<br />
|-<br />
| Dropbox for Business<br />
| $160* ($795/year)<br />
| <br />
| <br />
| <br />
| Dropbox for Business provides a shared pool of 1TB per user, at $795/year (five user minimum, 5TB), and $125 each additional user/year.<br />
|-<br />
| Box.com for Business<br />
| $180* ("unlimited" storage for $900/year)<br />
| <br />
| <br />
| <br />
| Box.com for Business provides "unlimited" storage at $15/user/month, five user minimum, or $900/year.<br />
|-<br />
| Dedicated colocated storage servers<br />
| $100* (e.g. $1300 for one year of 12TB rackmount server rental)<br />
|<br />
|<br />
|<br />
| Rent [http://www.ovh.com/us/dedicated-servers/storage/ storage servers from managed hosting colocation providers], and pool data across them. Benefits include bandwidth and electricity being included in the cost, and files could be made available online immediately. Negatives include needing to administer tens of servers.<br />
|-<br />
| Tahoe-LAFS<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|}<br />
<br />
== Non-options ==<br />
* Ink-based Consumer Optical Media (CDs, DVD, etc.) <br />
** Differences between Blu-Ray and DVD? DVDs do not last very long. The fact is, the history of optical writable media has been on of chicanery, failure, and overpromising while under-delivering. Some DVDs failed within a year. There are claims Blu-Ray is different, but fool me 3,504 times, shame on me.<br />
* BitTorrent Sync<br />
** Proprietary (currently), so not a good idea to use as an archival format/platform<br />
* Amazon S3 / Google Cloud Storage / Microsoft Azure Storage<br />
** Amazon S3 might be a viable waypoint for intra-month storage ($30.68/TB), but retrieval over the internet, as with Glacier, is expensive, $8499.08 for 100TB. Google's and Microsoft's offerings are all in the same price range.<br />
* Floppies<br />
** ''"Because 1.4 trillion floppies exists less than 700 billion floppies. HYPOTHETICALLY, if you set twenty stacks side by side, figure a quarter centimeter per floppy thickness, excluded the size of the drive needed to read the floppies you would still need a structure 175,000 ft. high to house them. Let's also assume that the failure rate for floppies is about 5% (everyone knows that varies by brand, usage, time of manufacture, materials used, etc, but lets say 5% per year). 70 million of those 1.4 trillion floppies are unusuable. Figuring 1.4 MB per floppy disk, you are losing approximately 100MB of porn each year. Assuming it takes 5 seconds to replace a bad floppy, you would have to spend 97,222 hrs/yr to replace them. Considering there are only 8,760 hrs per year, you would require a staff of 12 people replacing floppies around the clock or 24 people on 12 hr shifts. Figuring $7/hr you would spend $367,920 on labor alone. Figuring a nickel per bad floppy, you would need $3,500,000 annually in floppy disks, bringing your 1TB floppy raid operating costs (excluding electricity, etc) to $3,867, 920 and a whole landfill of corrupted porn. Thank you for destroying the planet and bankrupting a small country with your floppy based porn RAID."'' ([http://gizmodo.com/5431497/why-its-better-to-pretend-you-dont-know-anything-about-computers?comment=17793028#comments source])<br />
<br />
== From IRC ==<br />
<br />
<Drevkevac> we are looking to store 100TB+ of media offline for 25+ years<br />
<Drevkevac> if anyone wants to drop in, I will pastebin the chat log<br />
<rat> DVDR and BR-R are not high volume. When you have massive amounts of data, raid arrays have too many points of failure.<br />
<rat> Drevkevac: I work in a tv studio. We have 30+ years worth of tapes. And all of them are still good.<br />
<rat> find a hard drive from 30 years ago and see how well it hooks up ;)<br />
<brousch_> 1500 Taiyo Yuden Gold CD-Rs http://www.mediasupply.com/taiyo-yuden-gold-cd-rs.html<br />
<br />
<Drevkevac> still, if its true, you could do, perhaps, raidz3s in groups of 15 disks or so?<br />
<SketchCow> Please add paperbak to the wiki page.<br />
<SketchCow> Fuck Optical Media. not an option;.<br />
<Drevkevac> that would give you ~300GB per disk group, with 3 disks<br />
<br />
== Where are you going to put it? ==<br />
<br />
Okay, so you have the tech. Now you need a place for it to live.<br />
<br />
Possibilities:<br />
<br />
* The Internet Archive Physical Warehouse, Richmond, CA<br />
** The Internet Archive has several physical storage facilities, including warehouses in Richmond, CA (home of the Physical Archive) and the main location in San Francisco, CA. They have indicated they are willing to take copies of Archive Team-sponsored physical materials with the intent of them being ingested into the Archive at large over time, as costs lower and 100tb collections are not as big a drain (or a rash of funding arrives elsewhere).<br />
<br />
* Living Computer Museum, Seattle, WA<br />
** In discussions with Jason Scott, the Living Computer Museum has indicated they will have physical storage available for computer historical materials. Depending on the items being saved by Archive Team, they may be willing to host/hold copies for the forseable future.<br />
<br />
* Library of Congress, Washington, DC<br />
** The Library of Congress may be willing to take a donation of physical storage, although it is not indicated what they may do long-term with it.<br />
<br />
Multiple copies would of course be great.<br />
<br />
== Project-specific suggestions ==<br />
<br />
=== Twitch.tv (and other video services) ===<br />
<br />
* Keep the original video files in (semi-)offline storage, and store transcoded (compressed) versions on the Internet Archive.<br />
<br />
== See Also ==<br />
*[[Storage Media]]<br />
<br />
== References ==<br />
<references/><br />
<br />
{{Navigation box}}</div>Yipdwhttps://wiki.archiveteam.org/index.php?title=Valhalla&diff=20215Valhalla2014-09-21T17:21:59Z<p>Yipdw: </p>
<hr />
<div>[[Image:Ms internet on a disc.jpg|300px|right]]<br />
This wiki page is a collection of ideas for Project '''Valhalla'''.<br />
<br />
This project/discussion has come around because there is a class of data currently existing, several times a year, as a massive amount of data with "large, but nominal" status within the Internet Archive. The largest example is currently MobileMe, which is hundreds of terabytes in the Internet Archive system (and in need of WARC conversion), which represents a cost amount far outstripping its use. Another is TwitPic, which is currently available (and might continue to be available) but which has shown itself to be a bad actor with regards to longevity and predictability for its sunset. <br />
<br />
Therefore, there is an argument that there could be a "third place" that data collected by Archive Team could sit, until the Internet Archive (or another entity) grows its coffers/storage enough that 80-100tb is "no big deal", just like 1tb of data was annoying in 2009 and now is totally understandable for the value, i.e. Geocities. <br />
<br />
This is for short-term (or potentially also long-term) storage options, say five years or less, of data generated by Archive Team.<br />
<br />
* What options are out there, generally?<br />
* What are the costs, roughly?<br />
* What are the positives and negatives?<br />
<br />
There has been a lot of study in this area over the years, of course, so links to known authorities and debates will be welcome as well.<br />
<br />
Join the discussion in [irc://irc.efnet.org/huntinggrounds #huntinggrounds].<br />
<br />
== Goals ==<br />
<br />
We want to:<br />
<br />
* Dump an unlimited amount of data into something.<br />
* Recover that data at any point.<br />
<br />
We do not care about:<br />
<br />
* Immediate or continuous availability.<br />
<br />
We absolutely require:<br />
<br />
* Low (ideally, zero) human time for maintenance.<br />
* Data integrity. The storage medium must be impossibly durable or make it inexpensive/easy to copy and verify the data onto a fresh medium.<br />
<br />
It would be nice to have:<br />
<br />
* No special environmental conditions that could not be handled by a third party. (So nobody in Archive Team would have to set up some sort of climate-controlled data-cave; however, if this is already something that e.g. IA does and they are willing to lease space, that's cool.)<br />
<br />
== What does the Internet Archive do for this Situation, Anyway? ==<br />
<br />
''This section has not been cleared by the Internet Archive, and so should be considered a rough sketch.''<br />
<br />
The Internet Archive primarily wants "access" to the data it stores, so the primary storage methodology is spinning hard drives connected to a high-speed connection from multiple locations. These hard drives are between 4-6tb (as of 2014) and are of general grade, as is most of the hardware - the theory is that replacing cheap hardware is better than spending a lot of money on super-grade hardware (whatever that may be) and not being able to make the dollars stretch. Hundreds of drives die in a month and the resiliency of the system allows them all to hot-swap in replacements. <br />
<br />
There are multiple warehouses for storing the original books that are scanned, as well as materials like CD-ROMs and even hard drives. There are collections of tapes and CD-ROMs from previous iterations of storage, although they are thought of as drop-dead options instead of long-term archival storage - the preference is, first and foremost, the spinning hard drives.<br />
<br />
The Archive does not generally use tape technology, having run into the classic "whoops, no tape drive on earth reads these any more" and "whoops, this tape no longer works properly".<br />
<br />
The Archive has indicated that if Archive Team uses a physical storage method, such as tapes, paper, hard drives or anything else, that they are willing to store these materials "as long as they are exceedingly labelled".<br />
<br />
== Options ==<br />
{| class="wikitable sortable"<br />
! Storage type<br />
! Cost ($/TB/year)<br />
! Storage density (m³/TB)<br />
! Theoretical lifespan<br />
! Practical, tested lifespan<br />
! Notes<br />
|-<br />
| Hard drives (simple distributed pool)<br />
| $150 (full cost of best reasonable 1TB+ external HD)<br />
| <br />
| <br />
| <br />
| September 2014, best reasonable 1TB+ external HD is [http://thewirecutter.com/reviews/the-best-external-desktop-hard-drive/ a 4TB WD]. 25+ pool members would need one HD each plus a computer plus software to distribute data across the entire pool.<br />
|-<br />
| Hard drives (dedicated distributed pool)<br />
| <br />
| <br />
| <br />
| <br />
| An off-the-shelf or otherwise specified, dedicated, network storage device used exclusively as part of a distributed pool.<br />
|-<br />
| Hard drives (SPOF) <ref>The [[Internet Archive]]'s cost per TB, with 24/7 online hard drives, is approximately $2000 for forever.</ref><br />
| $62 (but you have to buy 180TB)<br />
| <br />
| <br />
| <br />
| For a single location to provide all storage needs, building a [https://www.backblaze.com/blog/backblaze-storage-pod-4/ Backblaze Storage Pod 4.0] runs an average of $11,000, providing 180TB of [http://bioteam.net/2011/08/why-you-should-never-build-a-backblaze-pod/ non-redundant, not-highly-available] storage. (You really want more than one pod mirroring your data, but this is the most effective way to get that much storage in one place.)<br />
|-<br />
| Commercial / archival-grade tapes<br />
| <br />
| <br />
| <br />
| <br />
| <br />
|-<br />
| Consumer tape systems (VHS, Betamax, cassette tapes, ...)<br />
| <br />
| <br />
| <br />
| <br />
| <br />
|-<br />
| Vinyl<br />
| <br />
| <br />
| <br />
| <br />
| <br />
|-<br />
| [http://www.ollydbg.de/Paperbak/index.html PaperBack]<br />
| <br />
| <br />
| <br />
| <br />
| 500KB per letter sheet means 1TB is 2,199,024 sheets, or ~4400 reams (500 sheets each), or an 8'x16' room filled with 6' tall stacks.<br />
|-<br />
| [http://ronja.twibright.com/optar/ Optar]<br />
| <br />
| <br />
| <br />
| <br />
| At 200KB per page, this has less than half the storage density of Paperback.<br />
|-<br />
| Blu-Ray<br />
| $40 (50 pack spindle of 25GB BD-Rs)<br />
| <br />
| 30 years<ref>On the basis of the described studies and assuming adequate consideration of the specified conditions for storage and handling, as well as verification of data after writing, we estimate the Imation CD, DVD or Blu-ray media to have a theoretical readability of up to 30 years. The primary caveat is how you handle and store the media. http://support.tdkperformance.com/app/answers/detail/a_id/1685/~/life-expectancy-of-optical-media </ref><br />
| <br />
| Lasts a LOT longer than CD/DVD, but should not be assumed to last more than a decade. [http://arstechnica.com/information-technology/2014/01/why-facebook-thinks-blu-ray-discs-are-perfect-for-the-data-center/ Raidz3 with Blu-rays Doing a backup in groups of 15 disks]. Comes to under $.04/GB which is cheap, and low initial investment (drives) too!<br><br />
<br>Specifically, a 50pack spindle of 25GB BD-Rs could readily hold 1TB of data for $30-50 per spindle. 50GB and 100GB discs are more expensive per GB.<br />
|-<br />
| [http://en.wikipedia.org/wiki/M-DISC M-DISC]<br />
| <br />
| <br />
| <br />
| <br />
| Unproven technology, but potentially interesting.<br />
|-<br />
| Flash media<br />
| <br />
| <br />
| <br />
| <br />
| Wears out quickly, not-so-good long term storage. Soliciting donations for old flash media from people, or sponsorship from flash companies?<br />
|-<br />
| Glass/metal etching<br />
| <br />
| <br />
| <br />
| <br />
| <br />
|-<br />
| Amazon Glacier<br />
| $122.88 (storage only, retrieval billed separately)<br />
| <br />
| average annual durability of 99.999999999% <ref>"Amazon Glacier is designed to provide average annual durability of 99.999999999% for an archive. The service redundantly stores data in multiple facilities and on multiple devices within each facility. To increase durability, Amazon Glacier synchronously stores your data across multiple facilities before returning SUCCESS on uploading archives. Glacier performs regular, systematic data integrity checks and is built to be automatically self-healing." Maciej Ceglowski thinks that's [https://blog.pinboard.in/2014/04/cloudy_snake_oil/ kinda bullshit compared to the failure events you don't plan for], of course.</ref><br />
| <br />
| Retrieval is billed separately. 5% or less per month into S3 is free (5% of 100TB is 5TB), and data can be copied out from S3 to a SATA HD for $2.50/hr. plus media handling and shipping fees. Downloading 5TB from S3 would cost $614.40 (~$122.88/TB), but only $44.82 to transfer to HD via USB 3 or SATA (USB 2 is slower).<br />
|-<br />
| Dropbox for Business<br />
| $160* ($795/year)<br />
| <br />
| <br />
| <br />
| Dropbox for Business provides a shared pool of 1TB per user, at $795/year (five user minimum, 5TB), and $125 each additional user/year.<br />
|-<br />
| Box.com for Business<br />
| $180* ("unlimited" storage for $900/year)<br />
| <br />
| <br />
| <br />
| Box.com for Business provides "unlimited" storage at $15/user/month, five user minimum, or $900/year.<br />
|-<br />
| Dedicated colocated storage servers<br />
| $100* (e.g. $1300 for one year of 12TB rackmount server rental)<br />
|<br />
|<br />
|<br />
| Rent [http://www.ovh.com/us/dedicated-servers/storage/ storage servers from managed hosting colocation providers], and pool data across them. Benefits include bandwidth and electricity being included in the cost, and files could be made available online immediately. Negatives include needing to administer tens of servers.<br />
|-<br />
| Tahoe-LAFS<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|}<br />
<br />
== Non-options ==<br />
* Ink-based Consumer Optical Media (CDs, DVD, etc.) <br />
** Differences between Blu-Ray and DVD? DVDs do not last very long. The fact is, the history of optical writable media has been on of chicanery, failure, and overpromising while under-delivering. Some DVDs failed within a year. There are claims Blu-Ray is different, but fool me 3,504 times, shame on me.<br />
* BitTorrent Sync<br />
** Proprietary (currently), so not a good idea to use as an archival format/platform<br />
* Amazon S3 / Google Cloud Storage / Microsoft Azure Storage<br />
** Amazon S3 might be a viable waypoint for intra-month storage ($30.68/TB), but retrieval over the internet, as with Glacier, is expensive, $8499.08 for 100TB. Google's and Microsoft's offerings are all in the same price range.<br />
* Floppies<br />
** ''"Because 1.4 trillion floppies exists less than 700 billion floppies. HYPOTHETICALLY, if you set twenty stacks side by side, figure a quarter centimeter per floppy thickness, excluded the size of the drive needed to read the floppies you would still need a structure 175,000 ft. high to house them. Let's also assume that the failure rate for floppies is about 5% (everyone knows that varies by brand, usage, time of manufacture, materials used, etc, but lets say 5% per year). 70 million of those 1.4 trillion floppies are unusuable. Figuring 1.4 MB per floppy disk, you are losing approximately 100MB of porn each year. Assuming it takes 5 seconds to replace a bad floppy, you would have to spend 97,222 hrs/yr to replace them. Considering there are only 8,760 hrs per year, you would require a staff of 12 people replacing floppies around the clock or 24 people on 12 hr shifts. Figuring $7/hr you would spend $367,920 on labor alone. Figuring a nickel per bad floppy, you would need $3,500,000 annually in floppy disks, bringing your 1TB floppy raid operating costs (excluding electricity, etc) to $3,867, 920 and a whole landfill of corrupted porn. Thank you for destroying the planet and bankrupting a small country with your floppy based porn RAID."'' ([http://gizmodo.com/5431497/why-its-better-to-pretend-you-dont-know-anything-about-computers?comment=17793028#comments source])<br />
<br />
== From IRC ==<br />
<br />
<Drevkevac> we are looking to store 100TB+ of media offline for 25+ years<br />
<Drevkevac> if anyone wants to drop in, I will pastebin the chat log<br />
<rat> DVDR and BR-R are not high volume. When you have massive amounts of data, raid arrays have too many points of failure.<br />
<rat> Drevkevac: I work in a tv studio. We have 30+ years worth of tapes. And all of them are still good.<br />
<rat> find a hard drive from 30 years ago and see how well it hooks up ;)<br />
<brousch_> 1500 Taiyo Yuden Gold CD-Rs http://www.mediasupply.com/taiyo-yuden-gold-cd-rs.html<br />
<br />
<Drevkevac> still, if its true, you could do, perhaps, raidz3s in groups of 15 disks or so?<br />
<SketchCow> Please add paperbak to the wiki page.<br />
<SketchCow> Fuck Optical Media. not an option;.<br />
<Drevkevac> that would give you ~300GB per disk group, with 3 disks<br />
<br />
== Where are you going to put it? ==<br />
<br />
Okay, so you have the tech. Now you need a place for it to live.<br />
<br />
Possibilities:<br />
<br />
* The Internet Archive Physical Warehouse, Richmond, CA<br />
** The Internet Archive has several physical storage facilities, including warehouses in Richmond, CA (home of the Physical Archive) and the main location in San Francisco, CA. They have indicated they are willing to take copies of Archive Team-sponsored physical materials with the intent of them being ingested into the Archive at large over time, as costs lower and 100tb collections are not as big a drain (or a rash of funding arrives elsewhere).<br />
<br />
* Living Computer Museum, Seattle, WA<br />
** In discussions with Jason Scott, the Living Computer Museum has indicated they will have physical storage available for computer historical materials. Depending on the items being saved by Archive Team, they may be willing to host/hold copies for the forseable future.<br />
<br />
* Library of Congress, Washington, DC<br />
** The Library of Congress may be willing to take a donation of physical storage, although it is not indicated what they may do long-term with it.<br />
<br />
Multiple copies would of course be great.<br />
<br />
== Project-specific suggestions ==<br />
<br />
=== Twitch.tv (and other video services) ===<br />
<br />
* Keep the original video files in (semi-)offline storage, and store transcoded (compressed) versions on the Internet Archive.<br />
<br />
== See Also ==<br />
*[[Storage Media]]<br />
<br />
== References ==<br />
<references/><br />
<br />
{{Navigation box}}</div>Yipdwhttps://wiki.archiveteam.org/index.php?title=Valhalla&diff=20148Valhalla2014-09-19T03:45:57Z<p>Yipdw: </p>
<hr />
<div>This wiki page is a collection of ideas for Project '''Valhalla'''.<br />
<br />
<SketchCow> Basically, we have this situation where we have stuff that is being threatened,<br />
and it's huge, and then it's either not so threatened or it's in a weird quantum state.<br />
<SketchCow> So, this really stretches the bounds of what IA does. It's a huge amount of data, it's not likely to be <br />
overly touched if the originals are up, and IA will spend/lose a lot of money pulling it into their infrastructure.<br />
<SketchCow> So maybe we can discuss actual, not pie-in-the-sky possibilities of what we can do to have <br />
some sort of not-IA pile of storage.<br />
<SketchCow> Since I like naming things, I'd call it the Valhalla Option<br />
<SketchCow> Some sort of idea of Happy Hunting Grounds where we have these actual backups of things.<br />
<br />
Join the discussion in [irc://irc.efnet.org/huntinggrounds #huntinggrounds].<br />
<br />
== Options ==<br />
{| class="wikitable sortable"<br />
! Storage type<br />
! Cost ($/TB/year)<br />
! Storage density (m³/TB)<br />
! Theoretical lifespan<br />
! Practical, tested lifespan<br />
! Notes<br />
|-<br />
| Hard drives<br />
| <br />
| <br />
| <br />
| <br />
| These would have to be live. HDDs decay quickly, and if they're not spinning, you can't detect failures. Possible software for this kind of thing; syncthing, Tahoe-LAFS, ...?<br />
|-<br />
| Commercial / archival-grade tapes<br />
| <br />
| <br />
| <br />
| <br />
| <br />
|-<br />
| Consumer tape systems (VHS, Betamax, cassette tapes, ...)<br />
| <br />
| <br />
| <br />
| <br />
| <br />
|-<br />
| Vinyl<br />
| <br />
| <br />
| <br />
| <br />
| <br />
|-<br />
| [http://www.ollydbg.de/Paperbak/index.html PaperBack]<br />
| <br />
| <br />
| <br />
| <br />
| <br />
|-<br />
| [http://ronja.twibright.com/optar/ Optar]<br />
| <br />
| <br />
| <br />
| <br />
| <br />
|-<br />
| Blu-Ray<br />
| $40<br />
| <br />
| 30 years<ref>On the basis of the described studies and assuming adequate consideration of the specified conditions for storage and handling, as well as verification of data after writing, we estimate the Imation CD, DVD or Blu-ray media to have a theoretical readability of up to 30 years. The primary caveat is how you handle and store the media. http://support.tdkperformance.com/app/answers/detail/a_id/1685/~/life-expectancy-of-optical-media </ref><br />
| <br />
| Lasts a LOT longer than CD/DVD, but should not be assumed to last more than a decade. [http://arstechnica.com/information-technology/2014/01/why-facebook-thinks-blu-ray-discs-are-perfect-for-the-data-center/ Raidz3 with Blu-rays Doing a backup in groups of 15 disks]. Comes to under $.04/GB which is cheap, and low initial investment (drives) too!<br />
|-<br />
| [http://en.wikipedia.org/wiki/M-DISC M-DISC]<br />
| <br />
| <br />
| <br />
| <br />
| Unproven technology, but potentially interesting.<br />
|-<br />
| Flash media<br />
| <br />
| <br />
| <br />
| <br />
| Wears out quickly, not-so-good long term storage. Soliciting donations for old flash media from people, or sponsorship from flash companies?<br />
|-<br />
| Glass/metal etching<br />
| <br />
| <br />
| <br />
| <br />
| <br />
|}<br />
<br />
== Non-options ==<br />
* Ink-based Consumer Optical Media (CDs, DVD, etc.) <br />
** Differences between Blu-Ray and DVD? DVDs do not last very long.<br />
* BitTorrent Sync<br />
** Proprietary (currently), so not a good idea to use as an archival format/platform<br />
* Amazon Glacier<br />
** Amazon Glacier seems like a a great idea, until you realize they mean 1 cent per gigabyte per month. This is $120 per terabyte per year. The transfer out of 100TB would also run over $10,000 the month its pulled from the system.<br />
* Floppies<br />
** ''"Because 1.4 trillion floppies exists less than 700 billion floppies. HYPOTHETICALLY, if you set twenty stacks side by side, figure a quarter centimeter per floppy thickness, excluded the size of the drive needed to read the floppies you would still need a structure 175,000 ft. high to house them. Let's also assume that the failure rate for floppies is about 5% (everyone knows that varies by brand, usage, time of manufacture, materials used, etc, but lets say 5% per year). 70 million of those 1.4 trillion floppies are unusuable. Figuring 1.4 MB per floppy disk, you are losing approximately 100MB of porn each year. Assuming it takes 5 seconds to replace a bad floppy, you would have to spend 97,222 hrs/yr to replace them. Considering there are only 8,760 hrs per year, you would require a staff of 12 people replacing floppies around the clock or 24 people on 12 hr shifts. Figuring $7/hr you would spend $367,920 on labor alone. Figuring a nickel per bad floppy, you would need $3,500,000 annually in floppy disks, bringing your 1TB floppy raid operating costs (excluding electricity, etc) to $3,867, 920 and a whole landfill of corrupted porn. Thank you for destroying the planet and bankrupting a small country with your floppy based porn RAID."'' ([http://gizmodo.com/5431497/why-its-better-to-pretend-you-dont-know-anything-about-computers?comment=17793028#comments source])<br />
<br />
== From IRC ==<br />
<br />
<Drevkevac> we are looking to store 100TB+ of media offline for 25+ years<br />
<Drevkevac> if anyone wants to drop in, I will pastebin the chat log<br />
<rat> DVDR and BR-R are not high volume. When you have massive amounts of data, raid arrays have too many points of failure.<br />
<rat> Drevkevac: I work in a tv studio. We have 30+ years worth of tapes. And all of them are still good.<br />
<rat> find a hard drive from 30 years ago and see how well it hooks up ;)<br />
<brousch_> 1500 Taiyo Yuden Gold CD-Rs http://www.mediasupply.com/taiyo-yuden-gold-cd-rs.html<br />
<br />
<Drevkevac> still, if its true, you could do, perhaps, raidz3s in groups of 15 disks or so?<br />
<SketchCow> Please add paperbak to the wiki page.<br />
<SketchCow> Fuck Optical Media. not an option;.<br />
<Drevkevac> that would give you ~300GB per disk group, with 3 disks<br />
<br />
== Where are you going to put it? ==<br />
<br />
Okay, so you have the tech. Now you need a place for it to live.<br />
<br />
Possibilities:<br />
<br />
* IA, as usual<br />
* Living Computer Museum, Seattle, WA<br />
<br />
Multiple copies would of course be great.<br />
<br />
== See Also ==<br />
*[[Storage Media]]<br />
<br />
== References ==<br />
<references/><br />
<br />
{{Navigation box}}</div>Yipdwhttps://wiki.archiveteam.org/index.php?title=Valhalla&diff=20147Valhalla2014-09-19T03:45:46Z<p>Yipdw: </p>
<hr />
<div>This wiki page is a collection of ideas for Project '''Valhalla'''.<br />
<br />
<SketchCow> Basically, we have this situation where we have stuff that is being threatened,<br />
and it's huge, and then it's either not so threatened or it's in a weird quantum state.<br />
<SketchCow> So, this really stretches the bounds of what IA does. It's a huge amount of data, it's not likely to be <br />
overly touched if the originals are up, and IA will spend/lose a lot of money pulling it into their infrastructure.<br />
<SketchCow> So maybe we can discuss actual, not pie-in-the-sky possibilities of what we can do to have <br />
some sort of not-IA pile of storage.<br />
<SketchCow> Since I like naming things, I'd call it the Valhalla Option<br />
<SketchCow> Some sort of idea of Happy Hunting Grounds where we have these actual backups of things.<br />
<br />
Join the discussion in [irc://irc.efnet.org/huntinggrounds #huntinggrounds].<br />
<br />
== Options ==<br />
{| class="wikitable sortable"<br />
! Storage type<br />
! Cost ($/TB/year)<br />
! Storage density (m³/TB)<br />
! Theoretical lifespan<br />
! Practical, tested lifespan<br />
! Notes<br />
|-<br />
| Hard drives<br />
| <br />
| <br />
| <br />
| <br />
| These would have to be live. HDDs decay quickly, and if they're not spinning, you can't detect failures. Possible software for this kind of thing; syncthing, Tahoe-LAFS, ...?<br />
|-<br />
| Commercial / archival-grade tapes<br />
| <br />
| <br />
| <br />
| <br />
| <br />
|-<br />
| Consumer tape systems (VHS, Betamax, cassette tapes, ...)<br />
| <br />
| <br />
| <br />
| <br />
| <br />
|-<br />
| Vinyl<br />
| <br />
| <br />
| <br />
| <br />
| <br />
|-<br />
| [http://www.ollydbg.de/Paperbak/index.html PaperBack]<br />
| <br />
| <br />
| <br />
| <br />
| <br />
|-<br />
| [http://ronja.twibright.com/optar/ Optar]<br />
| <br />
| <br />
| <br />
| <br />
| <br />
|-<br />
| Blu-Ray<br />
| $40<br />
| <br />
| 30 years<ref>On the basis of the described studies and assuming adequate consideration of the specified conditions for storage and handling, as well as verification of data after writing, we estimate the Imation CD, DVD or Blu-ray media to have a theoretical readability of up to 30 years. The primary caveat is how you handle and store the media. http://support.tdkperformance.com/app/answers/detail/a_id/1685/~/life-expectancy-of-optical-media </ref><br />
| <br />
| Lasts a LOT longer than CD/DVD, but should not be assumed to last more than a decade. [http://arstechnica.com/information-technology/2014/01/why-facebook-thinks-blu-ray-discs-are-perfect-for-the-data-center/ Raidz3 with Blu-rays Doing a backup in groups of 15 disks]. Comes to under $.04/GB which is cheap, and low initial investment (drives) too!<br />
|-<br />
| [http://en.wikipedia.org/wiki/M-DISC M-DISC]<br />
| <br />
| <br />
| <br />
| <br />
| Unproven technology, but potentially interesting.<br />
|-<br />
| Flash media<br />
| <br />
| <br />
| <br />
| <br />
| Wears out quickly, not-so-good long term storage. Soliciting donations for old flash media from people, or sponsorship from flash companies?<br />
|-<br />
| Glass/metal etching<br />
| <br />
| <br />
| <br />
| <br />
| <br />
|}<br />
<br />
== Non-options ==<br />
* Ink-based Consumer Optical Media (CDs, DVD, etc.) <br />
** Differences between Blu-Ray and DVD? DVDs do not last very long.<br />
* BitTorrent Sync<br />
** Proprietary (currently), so not a good idea to use as an archival format/platform<br />
* Amazon Glacier<br />
** Amazon Glacier seems like a a great idea, until you realize they mean 1 cent per gigabyte per month. This is $120 per terabyte per year. The transfer out of 100TB would also run over $10,000 the month its pulled from the system.<br />
* Floppies<br />
** ''"Because 1.4 trillion floppies exists less than 700 billion floppies. HYPOTHETICALLY, if you set twenty stacks side by side, figure a quarter centimeter per floppy thickness, excluded the size of the drive needed to read the floppies you would still need a structure 175,000 ft. high to house them. Let's also assume that the failure rate for floppies is about 5% (everyone knows that varies by brand, usage, time of manufacture, materials used, etc, but lets say 5% per year). 70 million of those 1.4 trillion floppies are unusuable. Figuring 1.4 MB per floppy disk, you are losing approximately 100MB of porn each year. Assuming it takes 5 seconds to replace a bad floppy, you would have to spend 97,222 hrs/yr to replace them. Considering there are only 8,760 hrs per year, you would require a staff of 12 people replacing floppies around the clock or 24 people on 12 hr shifts. Figuring $7/hr you would spend $367,920 on labor alone. Figuring a nickel per bad floppy, you would need $3,500,000 annually in floppy disks, bringing your 1TB floppy raid operating costs (excluding electricity, etc) to $3,867, 920 and a whole landfill of corrupted porn. Thank you for destroying the planet and bankrupting a small country with your floppy based porn RAID."'' ([http://gizmodo.com/5431497/why-its-better-to-pretend-you-dont-know-anything-about-computers?comment=17793028#comments source])<br />
<br />
== From IRC ==<br />
<br />
<Drevkevac> we are looking to store 100TB+ of media offline for 25+ years<br />
<Drevkevac> if anyone wants to drop in, I will pastebin the chat log<br />
<rat> DVDR and BR-R are not high volume. When you have massive amounts of data, raid arrays have too many points of failure.<br />
<rat> Drevkevac: I work in a tv studio. We have 30+ years worth of tapes. And all of them are still good.<br />
<rat> find a hard drive from 30 years ago and see how well it hooks up ;)<br />
<brousch_> 1500 Taiyo Yuden Gold CD-Rs http://www.mediasupply.com/taiyo-yuden-gold-cd-rs.html<br />
<br />
<Drevkevac> still, if its true, you could do, perhaps, raidz3s in groups of 15 disks or so?<br />
<SketchCow> Please add paperbak to the wiki page.<br />
<SketchCow> Fuck Optical Media. not an option;.<br />
<Drevkevac> that would give you ~300GB per disk group, with 3 disks<br />
<br />
== Where the t are you going to put it? ==<br />
<br />
Okay, so you have the tech. Now you need a place for it to live.<br />
<br />
Possibilities:<br />
<br />
* IA, as usual<br />
* Living Computer Museum, Seattle, WA<br />
<br />
Multiple copies would of course be great.<br />
<br />
== See Also ==<br />
*[[Storage Media]]<br />
<br />
== References ==<br />
<references/><br />
<br />
{{Navigation box}}</div>Yipdwhttps://wiki.archiveteam.org/index.php?title=Valhalla&diff=20105Valhalla2014-09-18T22:18:18Z<p>Yipdw: Undo revision 20103 by Yipdw (talk) (stupid idea)</p>
<hr />
<div>This wiki page is a collection of ideas for Project '''Valhalla'''.<br />
<br />
<SketchCow> Basically, we have this situation where we have stuff that is being threatened,<br />
and it's huge, and then it's either not so threatened or it's in a weird quantum state.<br />
So, this really stretches the bounds of what IA does. It's a huge amount of data, it's not likely <br />
to be overly touched if the originals are up, and IA will spend/lose a lot of money pulling it into their infrastructure.<br />
So maybe we can discuss actual, not pie-in-the-sky possibilities of what we can do to have some sort of not-IA pile of storage.<br />
<br />
== Options ==<br />
* Commercial/archival-grade tapes<br />
* Consumer tape systems (VHS, Betamax, cassette tapes, ...)<br />
* [http://www.ollydbg.de/Paperbak/index.html PaperBack]<br />
* [http://ronja.twibright.com/optar/ Optar]<br />
* Blu-ray: lasts a LOT longer than CD/DVD but should not be assumed to last more than a decade<br />
* [http://en.wikipedia.org/wiki/M-DISC M-DISC]: Unproven technology, but potentially interesting.<br />
* Flash media<br />
** Wears out quickly, not-so-good long term storage<br />
** Soliciting donations for old flash media from people, or sponsorship from flash companies?<br />
<br />
== Non-options ==<br />
* BitTorrent Sync<br />
** Proprietary (currently), so not a good idea to use as an archival format/platform<br />
* Amazon Glacier<br />
** Amazon Glacier seems like a a great idea, until you realize they mean 1 cent per gigabyte per month. This is $120 per terabyte per year. The transfer out of 100TB would also run over $10,000 the month its pulled from the system.<br />
<br />
{{Navigation box}}<br />
* Ink-based Consumer Optical Media (CDs, DVD, etc.) <br />
** Differences between Blu-Ray and DVD? DVDs do not last very long.</div>Yipdwhttps://wiki.archiveteam.org/index.php?title=Valhalla&diff=20103Valhalla2014-09-18T22:17:35Z<p>Yipdw: /* Options */</p>
<hr />
<div>This wiki page is a collection of ideas for Project '''Valhalla'''.<br />
<br />
<SketchCow> Basically, we have this situation where we have stuff that is being threatened,<br />
and it's huge, and then it's either not so threatened or it's in a weird quantum state.<br />
So, this really stretches the bounds of what IA does. It's a huge amount of data, it's not likely <br />
to be overly touched if the originals are up, and IA will spend/lose a lot of money pulling it into their infrastructure.<br />
So maybe we can discuss actual, not pie-in-the-sky possibilities of what we can do to have some sort of not-IA pile of storage.<br />
<br />
== Options ==<br />
* Tapes<br />
* [http://www.ollydbg.de/Paperbak/index.html PaperBack]<br />
* [http://ronja.twibright.com/optar/ Optar]<br />
* Blu-ray: lasts a LOT longer than CD/DVD but should not be assumed to last more than a decade<br />
* [http://en.wikipedia.org/wiki/M-DISC M-DISC]: Unproven technology, but potentially interesting.<br />
* Flash media<br />
** Wears out quickly, not-so-good long term storage<br />
** Soliciting donations for old flash media from people, or sponsorship from flash companies?<br />
* Periodically launching hard drives into space; has side effect of generating revenue for SpaceX<br />
<br />
== Non-options ==<br />
* BitTorrent Sync<br />
** Proprietary (currently), so not a good idea to use as an archival format/platform<br />
* Amazon Glacier<br />
** Amazon Glacier seems like a a great idea, until you realize they mean 1 cent per gigabyte per month. This is $120 per terabyte per year. The transfer out of 100TB would also run over $10,000 the month its pulled from the system.<br />
<br />
{{Navigation box}}<br />
* Ink-based Consumer Optical Media (CDs, DVD, etc.) <br />
** Differences between Blu-Ray and DVD? DVDs do not last very long.</div>Yipdwhttps://wiki.archiveteam.org/index.php?title=Twitch.tv&diff=19555Twitch.tv2014-08-10T17:48:31Z<p>Yipdw: </p>
<hr />
<div>{{Infobox project<br />
| title = Twitch.tv<br />
| URL = http://twitch.tv<br />
| image = Twitch_homepage_screenshot.png<br />
| logo = Twitch_Logo.png<br />
| project_status = '''Special Case''' (Archives to be deleted)<br />
| archiving_status = {{inprogress}}<br />
| irc = burnthetwitch<br />
| source = [https://github.com/ArchiveTeam/twitchtv-discovery-grab Phase 1],[https://github.com/ArchiveTeam/twitchtv-grab Phase 2], [https://github.com/ArchiveTeam/twitchtv-items Items]<br />
| tracker = [http://tracker.archiveteam.org/twitchdisco/ Phase 1], [http://tracker.archiveteam.org/twitchtv/ Phase 2]<br />
}}<br />
<br />
Justin.tv—sorry, ''cough'', I mean to say—'''Twitch.tv''' is a live video streaming service.<br />
<br />
== Shutdown ==<br />
<br />
<blockquote><br />
<br />
<p>'''Changes To VODs On Twitch'''</p><br />
<p>Aug 06 2014 · Engineering, Tech</p><br />
<br />
<p>Our goal at Twitch is straightforward: deliver the highest quality video. This includes the ability to watch video on demand (VOD) on all of our platforms, not just the website.</p><br />
<br />
<p>In order to create a system that supports live and VOD across the globe and on multiple platforms, we need to make significant changes to the way we’re currently storing video. Today, we’d like to discuss what these changes are, why they’re necessary, and how they benefit the entire Twitch community now and in the future.</p><br />
<br />
<p>[...]</p><br />
<br />
<p>''Looking at Viewership Data''</p><br />
<br />
<p>We found that the vast majority of past broadcast views happen within the first two weeks after they’re created. On the days following, viewership reduces exponentially.</p><br />
<br />
<p><br />
We also discovered that 80% of our storage capacity is filled with past broadcasts that are never watched. That’s multiple petabytes for video that no one has ever viewed.</p><br />
<br />
<p>Highlights, on the other hand, have much more value and longevity. Over their lifetime, highlights get 9x as many views as past broadcasts.</p><br />
<br />
<p>[...]</p><br />
<br />
<p>As for existing past broadcasts, '''beginning three weeks from today, we will begin removing them from Twitch servers'''. If you would like to keep your past broadcasts, we encourage you to begin exporting or making highlights of your best moments so that they’re saved for posterity.</p><br />
<br />
<p>[...]<ref>http://blog.twitch.tv/2014/08/update-changes-to-vods-on-twitch/</ref></p><br />
<br />
</blockquote><br />
<br />
== Site structure ==<br />
<br />
* HTML page requests: http://secure.twitch.tv/swflibs/TwitchPlayer.swf?videoId=a387099879<br />
* Flash requests: https://api.twitch.tv/api/videos/a387099879?as3=t<br />
* You can just type it directly as well: http://www.twitch.tv/twitchplayspokemon/b/503249758 → https://api.twitch.tv/api/videos/a503249758?as3=t<br />
* There's also this: https://api.justin.tv/api/broadcast/by_archive/503249758.json?onsite=true<br />
* JSON file contains list of URLs to their FLV files.<br />
* Highlights: https://api.twitch.tv/api/videos/c2673085?as3=t (notice the start and end offsets)<br />
* http://www.twitchtools.com/video-download.php provides the above service<br />
* <code>youtube-dl -i</code> appears to do some of them<br />
* Scraping: https://api.twitch.tv/kraken/videos/top?limit=20&offset=0&period=all<br />
* Is there any irregularities? Differences between highlights and past broadcasts?<br />
<br />
=== Storage Issues ===<br />
<br />
* How to decide which are important? 10+ views again? Do a discovery crawl first?<br />
* Tahoe-LAFS? Grab ''all'' the videos into temp storage?<br />
* Compress all the unwatched videos into postage stamp sized videos?<br />
<br />
== How can I help? ==<br />
<br />
Download and fire up your [[warrior]]! Then select Twitch Phase 2. Better yet, select Archive Team's Choice. <br />
<br />
Alternatively for advanced users, you can run the scripts manually. Details are described in the source code repos.<br />
<br />
Don't forget to '''[https://archive.org/donate/ donate to the Internet Archive]''' who will be hosting these files. Disk space is cheap but maintaining them is not!<br />
<br />
=== For those not using the Warrior ===<br />
<br />
Please run these sysctl tweaks to optimize uploads:<br />
<br />
<pre><br />
# Add to /etc/sysctl.conf and run "sysctl -p"<br />
# increase TCP max buffer size settable using setsockopt()<br />
net.core.rmem_max = 16777216<br />
net.core.wmem_max = 16777216<br />
# increase Linux autotuning TCP buffer limit<br />
net.ipv4.tcp_rmem = 4096 87380 16777216<br />
net.ipv4.tcp_wmem = 4096 65536 16777216<br />
</pre><br />
<br />
You can also issue them without modifying /etc/sysctl.conf by running e.g. <pre>sysctl net.core.rmem_max=16777216 net.core.wmem_max=16777216</pre>, but be aware that those won't stick around across reboots.<br />
<br />
=== What we are saving ===<br />
<br />
Currently:<br />
<br />
* twitchplayspokemon<br />
<br />
Next:<br />
<br />
* Videos with X or more views<br />
<br />
Anything culturally significant to add? Comment on [[Talk:Twitch.tv]]. Don't forget to sign your comments with <code><nowiki>~~~~</nowiki></code><br />
.<br />
<br />
== Archives ==<br />
<br />
=== By Archive Team ===<br />
<br />
TODO: Archives will be made available later as [[WARC]] files and will be accessible by the Wayback Machine. A searchable index will be made later.<br />
<br />
=== Renegade Stream Archives ===<br />
<br />
These archives are made in a manual fashion through the efforts of streaming communities. Feel free to expand this list.<br />
<br />
* [[Twitch.tv/Vinesauce|Vinesauce Stream Archival Effort]] - A crowdsourced effort by fans of the Vinesauce Group to archive 1714 of their streams.<br />
** [http://vinesauce.com/vinetalk/index.php?topic=4321.msg81672 Vinesauce Forum link]<br />
* [http://archive.klaxa.eu Klaxa.eu's Archive of The 4chan Cup] - An existing, complete archive of The 4chan Cup, starting from the 2014 Autumn Games up till today.<br />
<br />
== See Also == <br />
<br />
* [[Justin.tv]]<br />
<br />
== External links ==<br />
<br />
<br />
== References ==<br />
<br />
<references/><br />
<br />
{{navigation box}}<br />
<br />
[[Category:Video hosting services]]</div>Yipdwhttps://wiki.archiveteam.org/index.php?title=ArchiveBot&diff=19322ArchiveBot2014-07-23T05:40:59Z<p>Yipdw: </p>
<hr />
<div>[[File:Librarianmotoko.jpg|200px|right|thumb|Imagine Motoko Kusanagi as an archivist.]]<br />
<br />
'''ArchiveBot''' is an [[IRC]] bot designed to automate the archival of smaller websites (e.g. up to a few hundred thousand URLs). You give it a URL to start at, and it grabs all content under that URL, [[Wget_with_WARC_output|records it in a WARC]], and then uploads that WARC to ArchiveTeam servers for eventual injection into the [https://archive.org/search.php?query=collection%3Aarchivebot&sort=-publicdate Internet Archive] (or other archive sites).<br />
<br />
== Details ==<br />
<br />
To use ArchiveBot, drop by [http://chat.efnet.org:9090/?nick=&channels=%23archivebot&Login=Login '''#archivebot'''] on EFNet. To interact with ArchiveBot, you [https://raw2.github.com/ArchiveTeam/ArchiveBot/master/COMMANDS issue '''commands'''] by typing it into the channel. Note you will need channel operator (<code>@</code>) or voice (<code>+</code>) permissions in order to issue archiving jobs; please ask for assistance or leave a message describing the website you want to archive. The [http://arshboard.at.ninjawedding.org:4567 '''dashboard'''] shows the sites being downloaded currently.<br />
<br />
Follow [https://twitter.com/atarchivebot @ATArchiveBot] on [[Twitter]]!<br />
<br />
=== Components ===<br />
<br />
IRC interface<br />
:The bot listens for commands and reports back status on the IRC channel. You can ask it to archive a website or webpage, check whether the URL has been saved, change the delay time between request, or add some ignore rules. This IRC interface is collaborative meaning anyone with permission can adjust the parameter of jobs. Note that the bot isn't a chat bot so it will ignore you if it doesn't understand a command.<br />
<br />
Dashboard<br />
:The dashboard displays the URLs being downloaded. Each URL line in the dashboard is categorized into successes, warnings, and errors. It will be highlighted in yellow or red. It also provides RSS feeds.<br />
<br />
Backend<br />
:The backend contains the database of jobs and several maintenance tasks such as trimming logs and posting Tweets on Twitter. The backend is the centralized portion of ArchiveBot.<br />
<br />
Crawler<br />
:The crawler will download and spider the website into WARC files. The crawler is the distributed portion of ArchiveBot. Volunteers run nodes connected to the backend. The backend will tell the nodes what jobs to run. Once the node has finished, it reports back to the backend and uploads the WARC files to the staging server.<br />
<br />
Staging server<br />
:The staging server is the place where all the WARC files are uploaded temporary. Once the current batch has been approved, it will be uploaded to the Internet Archive for consumption by the Wayback Machine.<br />
<br />
ArchiveBot's source code can be found at https://github.com/ArchiveTeam/ArchiveBot. [[Dev|Contributions welcomed]]! Any issues or feature requests may be filed at [https://github.com/ArchiveTeam/ArchiveBot/issues the issue tracker]. <br />
<br />
=== People ===<br />
<br />
The IRC bot, backend and dashboard is operated by [[User:yipdw|yipdw]]. The staging server is operated by [[User:jscott|SketchCow]]. The crawlers are operated by various people.<br />
<br />
== Volunteer a Node ==<br />
<br />
If you have a machine with <br />
<br />
* lots of disk space (40 GB minimum / 200 GB recommended / 500 GB atypical)<br />
* 512 MB RAM (2 GB recommended, 2 GB swap recommended)<br />
* 10 mbps upload/download speeds (100 mbps recommended)<br />
* long-term availability (2 months minimum)<br />
* unrestricted internet accesses (no firewall/proxies/censorship)<br />
<br />
and would like to volunteer, please review the [https://github.com/ArchiveTeam/ArchiveBot/blob/master/INSTALL Pipeline Install] instructions and contact [[User:yipdw|yipdw]].<br />
<br />
== More ==<br />
<br />
Like ArchiveBot? Check out our [[Main_Page|homepage]] and other [[projects]]!<br />
<br />
{{navigation_box}}</div>Yipdwhttps://wiki.archiveteam.org/index.php?title=User:Yipdw&diff=19300User:Yipdw2014-07-20T07:36:16Z<p>Yipdw: </p>
<hr />
<div>Because I am listed as [[ArchiveBot]]'s primary contact, here are some ways to get in contact with me:<br />
<br />
{| width=100%<br />
| | '''IRC''' || yipdw on EFnet; usually in #archiveteam, #archiveteam-bs, #archivebot<br />
|-<br />
| | '''XMPP''' (Jabber, Google Talk) || yipdw@member.fsf.org<br />
|-<br />
| | '''Email''' || yipdw@member.fsf.org<br />
|}<br />
<br />
I can use OTR on the XMPP account if you'd like. Current fingerprint is 26E08144 8F752A1D E3683DA9 8A4EED08 C37ACA10.<br />
<br />
I can use PGP over email if you'd like. My [http://pgp.mit.edu/pks/lookup?op=get&search=0xA0E1B064735CC527 current PGP public key] expires 2015-08-25 and has fingerprint D47B 2BA8 4770 C5F8 62D6 D881 A0E1 B064 735C C527.</div>Yipdwhttps://wiki.archiveteam.org/index.php?title=User:Yipdw&diff=19299User:Yipdw2014-07-20T07:34:59Z<p>Yipdw: </p>
<hr />
<div>Because I am listed as [[ArchiveBot]]'s primary contact, here are some ways to get in contact with me:<br />
<br />
{| width=100%<br />
| | '''IRC''' || yipdw on EFnet; usually in #archiveteam, #archiveteam-bs, #archivebot<br />
|-<br />
| | '''XMPP''' (Jabber, Google Talk) || yipdw@member.fsf.org<br />
|-<br />
| | '''Email''' || yipdw@member.fsf.org<br />
|}<br />
<br />
I can use OTR on the XMPP account if you'd like. Current fingerprint is 26E08144 8F752A1D E3683DA9 8A4EED08 C37ACA10.<br />
<br />
I can use GPG over email if you'd like. My [http://pgp.mit.edu/pks/lookup?op=get&search=0xA0E1B064735CC527 current GPG public key] expires 2015-08-25 and has fingerprint D47B 2BA8 4770 C5F8 62D6 D881 A0E1 B064 735C C527.</div>Yipdwhttps://wiki.archiveteam.org/index.php?title=User:Yipdw&diff=19298User:Yipdw2014-07-20T07:34:30Z<p>Yipdw: </p>
<hr />
<div>Because I am listed as [[ArchiveBot]]'s primary contact, here are some ways to get in contact with me:<br />
<br />
{| width=100%<br />
| | '''IRC''' || yipdw on EFnet; usually in #archiveteam, #archiveteam-bs, #archivebot<br />
|-<br />
| | '''XMPP''' (Jabber, Google Talk) || yipdw@member.fsf.org<br />
|-<br />
| | '''Email''' || yipdw@member.fsf.org<br />
|}<br />
<br />
I can use OTR on the XMPP account if you'd like. Current fingerprint is 26E08144 8F752A1D E3683DA9 8A4EED08 C37ACA10.<br />
<br />
I can use GPG over email if you'd like. My [http://pgp.mit.edu/pks/lookup?op=get&search=0xA0E1B064735CC527 current GPG public key] has fingerprint D47B 2BA8 4770 C5F8 62D6 D881 A0E1 B064 735C C527.</div>Yipdwhttps://wiki.archiveteam.org/index.php?title=User:Yipdw&diff=19297User:Yipdw2014-07-20T07:34:11Z<p>Yipdw: </p>
<hr />
<div>Because I am listed as [[ArchiveBot]]'s primary contact, here are some ways to get in contact with me:<br />
<br />
{| width=100%<br />
| | '''IRC''' || yipdw on EFnet; usually in #archiveteam, #archiveteam-bs, #archivebot<br />
|-<br />
| | '''XMPP''' (Jabber, Google Talk) || yipdw@member.fsf.org<br />
|-<br />
| | '''Email''' || yipdw@member.fsf.org<br />
|}<br />
<br />
I can use OTR on the XMPP account if you'd like. Current fingerprint is 26E08144 8F752A1D E3683DA9 8A4EED08 C37ACA10.<br />
<br />
I can use GPG over email if you'd like. My [http://pgp.mit.edu/pks/lookup?op=get&search=0xA0E1B064735CC527 GPG public key] has key ID 735CC527 and fingerprint D47B 2BA8 4770 C5F8 62D6 D881 A0E1 B064 735C C527.</div>Yipdwhttps://wiki.archiveteam.org/index.php?title=User:Yipdw&diff=19296User:Yipdw2014-07-20T07:28:59Z<p>Yipdw: </p>
<hr />
<div>Because I am listed as [[ArchiveBot]]'s primary contact, here are some ways to get in contact with me:<br />
<br />
{| width=100%<br />
| | '''IRC''' || yipdw on EFnet; usually in #archiveteam, #archiveteam-bs, #archivebot<br />
|-<br />
| | '''XMPP''' (Jabber, Google Talk) || yipdw@member.fsf.org<br />
|}<br />
<br />
I can use OTR on the XMPP account if you'd like. Current fingerprint is 26E08144 8F752A1D E3683DA9 8A4EED08 C37ACA10.</div>Yipdwhttps://wiki.archiveteam.org/index.php?title=ArchiveBot&diff=19295ArchiveBot2014-07-20T07:27:56Z<p>Yipdw: </p>
<hr />
<div>[[File:Librarianmotoko.jpg|200px|right|thumb|Imagine Motoko Kusanagi as an archivist.]]<br />
<br />
'''ArchiveBot''' is an [[IRC]] bot designed to automate the archival of smaller websites (e.g. up to a few hundred thousand URLs). You give it a URL to start at, and it grabs all content under that URL, [[Wget_with_WARC_output|records it in a WARC]], and then uploads that WARC to ArchiveTeam servers for eventual injection into the [https://archive.org/search.php?query=collection%3Aarchivebot&sort=-publicdate Internet Archive] (or other archive sites).<br />
<br />
== Details ==<br />
<br />
To use ArchiveBot, drop by [http://chat.efnet.org:9090/?nick=&channels=%23archivebot&Login=Login '''#archivebot'''] on EFNet. To interact with ArchiveBot, you [https://raw2.github.com/ArchiveTeam/ArchiveBot/master/COMMANDS issue '''commands'''] by typing it into the channel. Note you will need channel operator (<code>@</code>) or voice (<code>+</code>) permissions in order to issue archiving jobs; please ask for assistance or leave a message describing the website you want to archive. The [http://archivebot.at.ninjawedding.org:4567 '''dashboard'''] shows the sites being downloaded currently.<br />
<br />
Follow [https://twitter.com/atarchivebot @ATArchiveBot] on [[Twitter]]!<br />
<br />
=== Components ===<br />
<br />
IRC interface<br />
:The bot listens for commands and reports back status on the IRC channel. You can ask it to archive a website or webpage, check whether the URL has been saved, change the delay time between request, or add some ignore rules. This IRC interface is collaborative meaning anyone with permission can adjust the parameter of jobs. Note that the bot isn't a chat bot so it will ignore you if it doesn't understand a command.<br />
<br />
Dashboard<br />
:The dashboard displays the URLs being downloaded. Each URL line in the dashboard is categorized into successes, warnings, and errors. It will be highlighted in yellow or red. It also provides RSS feeds.<br />
<br />
Backend<br />
:The backend contains the database of jobs and several maintenance tasks such as trimming logs and posting Tweets on Twitter. The backend is the centralized portion of ArchiveBot.<br />
<br />
Crawler<br />
:The crawler will download and spider the website into WARC files. The crawler is the distributed portion of ArchiveBot. Volunteers run nodes connected to the backend. The backend will tell the nodes what jobs to run. Once the node has finished, it reports back to the backend and uploads the WARC files to the staging server.<br />
<br />
Staging server<br />
:The staging server is the place where all the WARC files are uploaded temporary. Once the current batch has been approved, it will be uploaded to the Internet Archive for consumption by the Wayback Machine.<br />
<br />
ArchiveBot's source code can be found at https://github.com/ArchiveTeam/ArchiveBot. [[Dev|Contributions welcomed]]! Any issues or feature requests may be filed at [https://github.com/ArchiveTeam/ArchiveBot/issues the issue tracker]. <br />
<br />
=== People ===<br />
<br />
The IRC bot, backend and dashboard is operated by [[User:yipdw|yipdw]]. The staging server is operated by [[User:jscott|SketchCow]]. The crawlers are operated by various people.<br />
<br />
== Volunteer a Node ==<br />
<br />
If you have a machine with <br />
<br />
* lots of disk space (40 GB minimum / 200 GB recommended / 500 GB atypical)<br />
* 512 MB RAM (2 GB recommended, 2 GB swap recommended)<br />
* 10 mbps upload/download speeds (100 mbps recommended)<br />
* long-term availability (2 months minimum)<br />
* unrestricted internet accesses (no firewall/proxies/censorship)<br />
<br />
and would like to volunteer, please review the [https://github.com/ArchiveTeam/ArchiveBot/blob/master/INSTALL Pipeline Install] instructions and contact [[User:yipdw|yipdw]].<br />
<br />
== More ==<br />
<br />
Like ArchiveBot? Check out our [[Main_Page|homepage]] and other [[projects]]!<br />
<br />
{{navigation_box}}</div>Yipdwhttps://wiki.archiveteam.org/index.php?title=User:Yipdw&diff=19294User:Yipdw2014-07-20T07:26:18Z<p>Yipdw: </p>
<hr />
<div>Because I am listed as [[ArchiveBot]]'s primary contact, here are some ways to get in contact with me:<br />
<br />
{| width=100%<br />
| | '''IRC''' || yipdw @ irc.efnet.net on #archiveteam, #archiveteam-bs, #archivebot<br />
|-<br />
| | '''XMPP''' (Jabber, Google Talk) || yipdw@member.fsf.org<br />
|}<br />
<br />
I can use OTR on the XMPP account if you'd like. Current fingerprint is 26E08144 8F752A1D E3683DA9 8A4EED08 C37ACA10.</div>Yipdwhttps://wiki.archiveteam.org/index.php?title=User:Yipdw&diff=19293User:Yipdw2014-07-20T07:23:51Z<p>Yipdw: </p>
<hr />
<div>Because I am listed as [[ArchiveBot]]'s primary contact, here are some ways to get in contact with me:<br />
<br />
|| IRC || yipdw @ irc.efnet.net on #archiveteam, #archiveteam-bs, #archivebot ||<br />
|| XMPP (Jabber, Google Talk) || yipdw@member.fsf.org ||<br />
<br />
I can use OTR on the XMPP account if you'd like. Current fingerprint is 26E08144 8F752A1D E3683DA9 8A4EED08 C37ACA10.</div>Yipdwhttps://wiki.archiveteam.org/index.php?title=User:Yipdw&diff=19292User:Yipdw2014-07-20T07:21:35Z<p>Yipdw: Created page with "Because I am listed as ArchiveBot's primary contact, here are some ways to get in contact with me: IRC: yipdw @ irc.efnet.net on #archiveteam, #archiveteam-bs, #archivebo..."</p>
<hr />
<div>Because I am listed as [[ArchiveBot]]'s primary contact, here are some ways to get in contact with me:<br />
<br />
IRC: yipdw @ irc.efnet.net on #archiveteam, #archiveteam-bs, #archivebot<br />
XMPP (Jabber, Google Talk): yipdw@member.fsf.org<br />
<br />
I can use OTR on the XMPP account if you'd like. Current fingerprint is 26E08144 8F752A1D E3683DA9 8A4EED08 C37ACA10.</div>Yipdwhttps://wiki.archiveteam.org/index.php?title=Template:Navigation_box&diff=19028Template:Navigation box2014-06-13T01:31:10Z<p>Yipdw: </p>
<hr />
<div><br clear="all" /><center><!--<br />
<br />
<br />
<br />
<br />
Rows are in Alphabetic order. Except "Current events" at the top and "About Archive Team" at the bottom.<br />
Items inside rows are in Alphabetic order too.<br />
Easy : )<br />
<br />
<br />
<br />
<br />
--><br />
{| class="mw-collapsible mw-collapsed" style="border: 1px solid #aaa; background-color: #f9f9f9; color: black; margin: 0.5em 0 0.5em 1em; padding: 0.2em; font-size: 100%;"<br />
| colspan=3 align=center style="background: #ccccff;" | <span style="float: right;"><span class="plainlinks">[[{{fullurl:Template:Navigation_box}} view]]&nbsp;&nbsp;[[{{fullurl:Template:Navigation_box|action=edit}} edit]]</span>&nbsp;</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;'''[[Archive Team]]'''&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''[[Current events]]''' || [[Alive... OR ARE THEY]] {{·}} [[Deathwatch]] {{·}} [[Projects]] {{·}} '''[[Archives|Download available archives]]''' || rowspan=5 | [[File:Archiveteam.jpg|right|150px]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''[[Archiving projects]]''' || [[Archive.is]] {{·}} [[BetaArchive]] {{·}} [[Internet Archive]] {{·}} [[It Died]] {{·}} [[OldApps.com]] {{·}} [[OldVersion.com]] {{·}} [[OSBetaArchive]] {{·}} [[TEXTFILES]]<br>[[The Dead, the Dying & The Damned]] {{·}} [[UK Web Archive]] {{·}} [[WebCite]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''Blogs/[[Web hostings]]''' || [[Angelfire]] {{·}} [[Blogger]] {{·}} [[Blogster]] {{·}} [[EtherPad]] {{·}} [[FortuneCity]] {{·}} [[Free ProHosting]] {{·}} [[Fuelmyblog]] {{·}} [[GeoCities]] ([[GeoCities Torrent Patch|patch]]) {{·}} [[Google Sites]] {{·}} [[Jux]] {{·}} [[LiveJournal]] {{·}} [[My Opera]] {{·}} [[Open Diary]] {{·}} [[Posterous]] {{·}} [[Prodigy.net]] {{·}} [[Proust]] {{·}} [[Splinder]] {{·}} [[Tripod]] {{·}} [[Vox]] {{·}} [[Windows Live Spaces]] {{·}} [[Wordpress.com]] {{·}} [[Xanga]] {{·}} [[Yahoo! Blog]] {{·}} [[Zapd]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''[[:Category:Corporations|Corporations]]''' || [[Apple]] {{·}} [[IBM]] {{·}} [[Google]] {{·}} [[Microsoft]] {{·}} [[Yahoo!]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''Events''' || [[Arab Spring]] {{·}} [[Occupy movement]] {{·}} [[Spanish Revolution]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''Font Repos''' || [[Google Web Fonts]] {{·}} [[GNU FreeFont]] {{·}} [[Fontspace]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''[[Image hosting services]]''' || [[Cameroid]] {{·}} [[Flickr]] {{·}} [[Geograph Britain and Ireland]] {{·}} [[ImageShack]] {{·}} [[Imgur]] {{·}} [[Instagr.am]] {{·}} [[Panoramio]] {{·}} [[Photobucket]] {{·}} [[Picasa]] {{·}} [[Picplz ]] {{·}} [[Ptch]] {{·}} [[puu.sh]] {{·}} [[Snapjoy]] {{·}} [[TwitPic]] {{·}} [[Wikimedia Commons]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''Knowledge/[[Wikis]]''' || colspan=2 | [[arXiv]] {{·}} [[Citizendium]] {{·}} [[Edit.This]] {{·}} [[Encyclopedia Dramatica]] {{·}} [[Everything2]] {{·}} [[infoAnarchy]] {{·}} [[GeoNames]] {{·}} [[GNUPedia]] {{·}} [[Google Books]] {{·}} [[Insurgency Wiki]] {{·}} [[Knol]] {{·}} [[Nupedia]] {{·}} [[OpenCourseWare]] {{·}} [[OpenStreetMap]] {{·}} [[Project Gutenberg]] {{·}} [[Puella Magi]] {{·}} [[Referata]] {{·}} [[SongMeanings]] {{·}} [[ShoutWiki]] {{·}} [[The Internet Movie Database]] {{·}} [[The Pirate Bay]] {{·}} [[TropicalWikis]] {{·}} [[Urban Dictionary]] {{·}} [[Webmonkey]] {{·}} [[Wikia]] {{·}} [[Wikidot]] {{·}} [[WikiHow]] {{·}} [[Wikkii]] {{·}} [[WikiLeaks]] {{·}} [[Wikipedia]] {{·}} [[Wikispaces]] {{·}} [[Wik.is]] {{·}} [[Wiki-Site]] {{·}} [[WikiTravel]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''[[Microblogging]]''' || colspan=2 | [[Identi.ca]] {{·}} [[Jaiku]] {{·}} [[Plurk]] {{·}} [[Sina Weibo]] {{·}} [[Tumblr]] {{·}} [[Twitter]] {{·}} [[TwitLonger]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''Music/Audio''' || colspan=2 | [[Audimated.com]] {{·}} [[digCCmixter]] {{·}} [[Dogmazic.net]] {{·}} [[Free Music Archive]] {{·}} [[Gogoyoko]] {{·}} [[Indaba Music]] {{·}} [[Jamendo]] {{·}} [[Last.fm]] {{·}} [[MOG]] {{·}} [[PureVolume]] {{·}} [[Reverbnation]] {{·}} [[ShareTheMusic]] {{·}} [[SoundCloud]] {{·}} [[Soundpedia]] {{·}} [[Twaud.io]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''People''' || colspan=2 | [[Michael S. Hart]] {{·}} [[Steve Jobs]] {{·}} [[Mark Pilgrim]] {{·}} [[Dennis Ritchie]] {{·}} [[Len Sassaman Project]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''Q&A''' || colspan=2 | [[Askville]] {{·}} [[Answerbag]] {{·}} [[Answers.com]] {{·}} [[Ask.com]] {{·}} [[Askalo]] {{·}} [[Baidu Knows]] {{·}} [[Blurtit]] {{·}} [[ChaCha]] {{·}} [[Expers Exchange]] {{·}} [[GirlsAskGuys]] {{·}} [[Google Answers]] {{·}} [[Google Questions and Answers]] {{·}} [[JustAnswer]] {{·}} [[MetaFilter]] {{·}} [[Quora]] {{·}} [[StackExchange]] {{·}} [[The AnswerBank]] {{·}} [[The Internet Oracle]] {{·}} [[Uclue]] {{·}} [[WikiAnswers]] {{·}} [[Yahoo! Answers]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''Social bookmarking''' || colspan=2 | [[Addinto]] {{·}} [[Backflip]] {{·}} [[Balatarin]] {{·}} [[BibSonomy]] {{·}} [[Bkmrx]] {{·}} [[Blinklist]] {{·}} [[BlogMarks]] {{·}} [[BookmarkSync]] {{·}} [[CiteULike]] {{·}} [[Connotea]] {{·}} [[Delicious]] {{·}} [[Digg]] {{·}} [[Diigo]] {{·}} [[Dir.eccion.es]] {{·}} [[Evernote]] {{·}} [[Excite Bookmark]] {{·}} [[Faves]] {{·}} [[Favilous]] {{·}} [[folkd]] {{·}} [[Freelish]] {{·}} [[Getboo]] {{·}} [[GiveALink.org]] {{·}} [[Gnolia]] {{·}} [[Google Bookmarks]] {{·}} [[HeyStaks]] {{·}} [[IndianPad]] {{·}} [[Kippt]] {{·}} [[Knowledge Plaza]] {{·}} [[Licorize]] {{·}} [[Linkwad]] {{·}} [[Menéame]] {{·}} [[Microsoft Developer Network]] {{·}} [[Microsoft TechNet]] {{·}} [[Mister Wong]] {{·}} [[My Web]] {{·}} [[Mylink Vault]] {{·}} [[Newsvine]] {{·}} [[Oneview]] {{·}} [[Pearltrees]] {{·}} [[Pinboard]] {{·}} [[Pocket]] {{·}} [[Reddit]] {{·}} [[sabros.us]] {{·}} [[Scloog]] {{·}} [[Scuttle]] {{·}} [[Simpy]] {{·}} [[SiteBar]] {{·}} [[Squidoo]] {{·}} [[StumbleUpon]] {{·}} [[Twine]] {{·}} [[Vizited]] {{·}} [[Yummymarks]] {{·}} [[Xmarks]] {{·}} [[Zootool]] {{·}} [[Zotero]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''Social networks''' || colspan=2 | [[Bebo]] {{·}} [[BlackPlanet]] {{·}} [[Classmates.com]] {{·}} [[Cyworld]] {{·}} [[deviantART]] {{·}} [[Dopplr]] {{·}} [[douban]] {{·}} [[Facebook]] {{·}} [[Flixster]] {{·}} [[Friendster]] {{·}} [[Gaia Online]] {{·}} [[Google+]] {{·}} [[Habbo]] {{·}} [[hi5]] {{·}} [[Hyves]] {{·}} [[LinkedIn]] {{·}} [[mixi]] {{·}} [[MyHeritage]] {{·}} [[MyLife]] {{·}} [[Myspace]] {{·}} [[Netlog]] {{·}} [[Odnoklassniki]] {{·}} [[Orkut]] {{·}} [[Plaxo]] {{·}} [[Qzone]] {{·}} [[Renren]] {{·}} [[Skyrock]] {{·}} [[Sonico.com]] {{·}} [[Tagged]] {{·}} [[Viadeo]] {{·}} [[Vkontakte]] {{·}} [[WeeWorld]] {{·}} [[Wretch]] {{·}} [[Social network|more sites...]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''Software''' || colspan=2 | [[Android Development]] {{·}} [[Alioth]] {{·}} [[Assembla]] {{·}} [[BerliOS]] {{·}} [[Betavine]] {{·}} [[Bitbucket]] {{·}} [[BountySource]] {{·}} [[CodePlex]] {{·}} [[Freepository]] {{·}} [[Free Software Foundation]] {{·}} [[GNU Savannah]] {{·}} [[GitHub]] {{·}} [[Gitorious]] {{·}} [[Gna!]] {{·}} [[Google Code]] {{·}} [[java.net]] {{·}} [[JavaForge]] {{·}} [[KnowledgeForge]] {{·}} [[Launchpad]] {{·}} [[LuaForge]] {{·}} [[mozdev]] {{·}} [[OSOR.eu]] {{·}} [[OW2 Consortium]] {{·}} [[Openmoko]] {{·}} [[Ourproject.org]] {{·}} [[Project Kenai]] {{·}} [[RubyForge]] {{·}} [[SEUL.org]] {{·}} [[SourceForge]] {{·}} [[tigris.org]] {{·}} [[Transifex]] {{·}} [[TuxFamily]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''[[Video hosting services]]''' || colspan=2 | [[Academic Earth]] {{·}} [[Blip.tv]] {{·}} [[Google Video]] {{·}} [[Justin.tv]] {{·}} [[TED Talks]] {{·}} [[Ustream]] {{·}} [[Viddler]] {{·}} [[Vimeo]] {{·}} [[Yahoo! Video]] {{·}} [[YouTube]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''Other''' || colspan=2 | [[4chan]] {{·}} [[April Fools' Day]] {{·}} [[Amplicate]] {{·}} [[Circavie]] {{·}} [[Co.mments]] {{·}} [[Dmoz]] {{·}} [[Electronic Frontier Foundation]] {{·}} [[Feedly]] {{·}} [[Ficlets]] {{·}} [[FriendFeed]] {{·}} [[Gopher]] {{·}} [[Google Books Ngram]] {{·}} [[Google Reader]] {{·}} [[IFTTT]] {{·}} [[isoHunt]] {{·}} [[MegaUpload]] {{·}} [[MyBlogLog]] {{·}} [[Pastebin]] {{·}} [[Propeller.com]] {{·}} [[Quantcast]] {{·}} [[Salon Table Talk]] {{·}} [[SOPA blackout pages]] {{·}} [[World Wide Web]] {{·}} [[Yahoo! Buzz]] {{·}} [[Yahoo! Groups]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''Teams''' || colspan=2 | [[Bibliotheca Anonoma]] {{·}} [[LibreTeam]] {{·}} [[URLTeam]] {{·}} [[Yahoo Video Warroom]] {{·}} [[WikiTeam]]<br />
|-<br />
| align=center width=150px style="background: #ddddff;" | '''About [[Archive Team]]''' || colspan=2 | [[Introduction]] {{·}} [[Philosophy]] {{·}} [[Who We Are]] {{·}} [[Why Back Up?]] {{·}} [[Software]] {{·}} [[Films and documentaries about archiving]] {{·}} [[Formats]] {{·}} [[Cheap storage]] {{·}} [[Storage Media]] {{·}} [[Recommended Reading]] {{·}} [[Frequently Asked Questions|FAQ]]<br />
|}<br />
</center>[[Category:Archive Team]]<noinclude>[[Category:Templates]]</noinclude></div>Yipdwhttps://wiki.archiveteam.org/index.php?title=Justin.tv&diff=19027Justin.tv2014-06-13T01:30:41Z<p>Yipdw: </p>
<hr />
<div>{{Infobox project<br />
| title = Justin.tv<br />
| image = justintv_homepage_screenshot.png<br />
| logo = justintv_logo.png<br />
| URL = http://justin.tv<br />
| project_status = '''Special Case''' (archives to be deleted)<br />
| archiving_status = {{inprogress}}<br />
| irc = justouttv<br />
| tracker = [http://tracker.archiveteam.org/justintv/ justintv]<br />
| source = [https://github.com/ArchiveTeam/justintv-grab justintv-grab]<br />
}}<br />
<br />
'''Justin.tv''' is a live video streaming service.<br />
<br />
== Shutdown ==<br />
<br />
<blockquote><br />
<p>Changes to the Video Archive System</p><br />
<br />
<p>Dylan Reichstadt<br />
posted this on May 29 13:16</p><br />
<p>Over the last few months, our staff has been reviewing data surrounding our archive and VOD (Video on Demand) system. We found that more than half of our VODs are unwatched (with 0 or 1 total views), while the vast majority are rarely watched (with 10 or less views). This data was essential in better understanding how our service is being used. Even when adding the direct upload to YouTube functionality, we found this feature was seldom used. It’s quite clear: JTV is a home for live broadcasts. Viewers come to Justin.tv because they want to consume content and interact with their communities in real-time.</p><br />
<br />
<p>So, taking into consideration the above findings and countless discussions, we have concluded to remove all archiving after June 8, 2014. This means that live broadcasts will no longer be recorded.</p><br />
<br />
<p>We understand that archiving can be a very essential element for our broadcasters. We also understand that there are some community members who enjoy catching up on a past broadcast they missed. However, as a live video website, we want to put our focus on our live video delivery system, as this has received the most usage.<ref>https://help.justin.tv/entries/41803380-Changes-to-Video-Archive-System</ref></p><br />
<br />
</blockquote><br />
<br />
== Site structure ==<br />
<br />
* http://www.justin.tv/p/rest_api<br />
<br />
== How can I help? ==<br />
<br />
Start up a [[Warrior]]. Justin.tv is the default project, so it should start up automatically.<br />
<br />
Alternatively, the scripts can be run manually. The instructions to do so are in the source code repo. <br />
<br />
Finally, please join us on [[IRC]] at #justouttv to report errors, talk about the project, etc.<br />
<br />
== Archives ==<br />
<br />
Archives in [[WARC]] format are uploaded to the Internet Archive [https://archive.org/details/justintv justintv] collection. The collection contains ~117600 videos (~9900 GB).<br />
<br />
Due to the nature of the Justin.tv archives, not all videos could be stored in the Internet Archive due to the immense size. The Justin.tv video archives are about 1 PB. Unfortunately, there is no practical way to determine which videos are "important".<br />
<br />
Videos with 10 or more views were selected (scraped through their search function or by a CSV manifest provided by Justin.tv staff) for archiving by the Warrior. Some videos existed on lost/mismanaged Justin.tv servers (media6.justin.tv, media18.justin.tv, media27.justin.tv) and staff are unable to recover them.<br />
<br />
== External links ==<br />
<br />
* [http://mashable.com/2014/05/31/justin-tv-delete-video-archives/ "Justin.tv to Delete All Video Archives"]<br />
* [https://twitter.com/textfiles/status/476064989879349248 @textfiles: Huge congratulations to the volunteers of @archiveteam (...)]<br />
* [http://gigaom.com/2014/06/09/volunteer-archivists-save-big-parts-of-the-justin-tv-archive/ "Volunteer archivists save big parts of the Justin.tv archive"]<br />
<br />
== References ==<br />
<br />
<references/><br />
<br />
{{navigation box}}<br />
<br />
[[Category:Video hosting services]]</div>Yipdw