Difference between revisions of "Valhalla"

From Archiveteam
Jump to navigation Jump to search
Line 150: Line 150:
== Project-specific suggestions ==
== Project-specific suggestions ==


=== Twitch.tv (and other video services)
=== Twitch.tv (and other video services) ===


* Keep the original video files in (semi-)offline storage, and store transcoded (compressed) versions on the Internet Archive.
* Keep the original video files in (semi-)offline storage, and store transcoded (compressed) versions on the Internet Archive.

Revision as of 06:53, 19 September 2014

This wiki page is a collection of ideas for Project Valhalla.

This project/discussion has come around because there is a class of data currently existing, several times a year, as a massive amount of data with "large, but nominal" status within the Internet Archive. The largest example is currently MobileMe, which is hundreds of terabytes in the Internet Archive system (and in need of WARC conversion), which represents a cost amount far outstripping its use. Another is TwitPic, which is currently available (and might continue to be available) but which has shown itself to be a bad actor with regards to longevity and predictability for its sunset.

Therefore, there is an argument that there could be a "third place" that data collected by Archive Team could sit, until the Internet Archive (or another entity) grows its coffers/storage enough that 80-100tb is "no big deal", just like 1tb of data was annoying in 2009 and now is totally understandable for the value, i.e. Geocities.

This is for short-term (or potentially also long-term) storage options, say five years or less, of data generated by Archive Team.

  • What options are out there, generally?
  • What are the costs, roughly?
  • What are the positives and negatives?

There has been a lot of study in this area over the years, of course, so links to known authorities and debates will be welcome as well.

Join the discussion in #huntinggrounds.

What does the Internet Archive do for this Situation, Anyway?

This section has not been cleared by the Internet Archive, and so should be considered a rough sketch.

The Internet Archive primarily wants "access" to the data it stores, so the primary storage methodology is spinning hard drives connected to a high-speed connection from multiple locations. These hard drives are between 4-6tb (as of 2014) and are of general grade, as is most of the hardware - the theory is that replacing cheap hardware is better than spending a lot of money on super-grade hardware (whatever that may be) and not being able to make the dollars stretch. Hundreds of drives die in a month and the resiliency of the system allows them all to hot-swap in replacements.

There are multiple warehouses for storing the original books that are scanned, as well as materials like CD-ROMs and even hard drives. There are collections of tapes and CD-ROMs from previous iterations of storage, although they are thought of as drop-dead options instead of long-term archival storage - the preference is, first and foremost, the spinning hard drives.

The Archive does not generally use tape technology, having run into the classic "whoops, no tape drive on earth reads these any more" and "whoops, this tape no longer works properly".

The Archive has indicated that if Archive Team uses a physical storage method, such as tapes, paper, hard drives or anything else, that they are willing to store these materials "as long as they are exceedingly labelled".

Options

Storage type Cost ($/TB/year) Storage density (m³/TB) Theoretical lifespan Practical, tested lifespan Notes
Hard drives[1] These would have to be live. HDDs decay quickly, and if they're not spinning, you can't detect failures. Possible software for this kind of thing; syncthing, Tahoe-LAFS, ...?
Commercial / archival-grade tapes
Consumer tape systems (VHS, Betamax, cassette tapes, ...)
Vinyl
PaperBack
Optar
Blu-Ray $40 30 years[2] Lasts a LOT longer than CD/DVD, but should not be assumed to last more than a decade. Raidz3 with Blu-rays Doing a backup in groups of 15 disks. Comes to under $.04/GB which is cheap, and low initial investment (drives) too!
M-DISC Unproven technology, but potentially interesting.
Flash media Wears out quickly, not-so-good long term storage. Soliciting donations for old flash media from people, or sponsorship from flash companies?
Glass/metal etching

Non-options

  • Ink-based Consumer Optical Media (CDs, DVD, etc.)
    • Differences between Blu-Ray and DVD? DVDs do not last very long. The fact is, the history of optical writable media has been on of chicanery, failure, and overpromising while under-delivering. Some DVDs failed within a year. There are claims Blu-Ray is different, but fool me 3,504 times, shame on me.
  • BitTorrent Sync
    • Proprietary (currently), so not a good idea to use as an archival format/platform
  • Amazon Glacier
    • Amazon Glacier seems like a a great idea, until you realize they mean 1 cent per gigabyte per month. This is $120 per terabyte per year. The transfer out of 100TB would also run over $10,000 the month its pulled from the system through a standard "send it elsewhere" approach, although there are notably cheaper options for "mail me a hard drive" and so on. Still, since the plan would be for everything going in to eventually come out, it's not a great option.
  • Floppies
    • "Because 1.4 trillion floppies exists less than 700 billion floppies. HYPOTHETICALLY, if you set twenty stacks side by side, figure a quarter centimeter per floppy thickness, excluded the size of the drive needed to read the floppies you would still need a structure 175,000 ft. high to house them. Let's also assume that the failure rate for floppies is about 5% (everyone knows that varies by brand, usage, time of manufacture, materials used, etc, but lets say 5% per year). 70 million of those 1.4 trillion floppies are unusuable. Figuring 1.4 MB per floppy disk, you are losing approximately 100MB of porn each year. Assuming it takes 5 seconds to replace a bad floppy, you would have to spend 97,222 hrs/yr to replace them. Considering there are only 8,760 hrs per year, you would require a staff of 12 people replacing floppies around the clock or 24 people on 12 hr shifts. Figuring $7/hr you would spend $367,920 on labor alone. Figuring a nickel per bad floppy, you would need $3,500,000 annually in floppy disks, bringing your 1TB floppy raid operating costs (excluding electricity, etc) to $3,867, 920 and a whole landfill of corrupted porn. Thank you for destroying the planet and bankrupting a small country with your floppy based porn RAID." (source)

From IRC

<Drevkevac> we are looking to store 100TB+ of media offline for 25+ years
<Drevkevac> if anyone wants to drop in, I will pastebin the chat log
<rat> DVDR and BR-R are not high volume. When you have massive amounts of data, raid arrays have too many points of failure.
<rat> Drevkevac: I work in a tv studio. We have 30+ years worth of tapes. And all of them are still good.
<rat> find a hard drive from 30 years ago and see how well it hooks up ;)
<brousch_> 1500 Taiyo Yuden Gold CD-Rs http://www.mediasupply.com/taiyo-yuden-gold-cd-rs.html
<Drevkevac> still, if its true, you could do, perhaps, raidz3s in groups of 15 disks or so?
<SketchCow> Please add paperbak to the wiki page.
<SketchCow> Fuck Optical Media. not an option;.
<Drevkevac> that would give you ~300GB per disk group, with 3 disks

Where are you going to put it?

Okay, so you have the tech. Now you need a place for it to live.

Possibilities:

  • The Internet Archive Physical Warehouse, Richmond, CA
    • The Internet Archive has several physical storage facilities, including warehouses in Richmond, CA (home of the Physical Archive) and the main location in San Francisco, CA. They have indicated they are willing to take copies of Archive Team-sponsored physical materials with the intent of them being ingested into the Archive at large over time, as costs lower and 100tb collections are not as big a drain (or a rash of funding arrives elsewhere).
  • Living Computer Museum, Seattle, WA
    • In discussions with Jason Scott, the Living Computer Museum has indicated they will have physical storage available for computer historical materials. Depending on the items being saved by Archive Team, they may be willing to host/hold copies for the forseable future.
  • Library of Congress, Washington, DC
    • The Library of Congress may be willing to take a donation of physical storage, although it is not indicated what they may do long-term with it.

Multiple copies would of course be great.

Project-specific suggestions

Twitch.tv (and other video services)

  • Keep the original video files in (semi-)offline storage, and store transcoded (compressed) versions on the Internet Archive.

See Also

References

  1. The Internet Archive's cost per TB, with 24/7 online hard drives, is approximately $2000 for forever.
  2. On the basis of the described studies and assuming adequate consideration of the specified conditions for storage and handling, as well as verification of data after writing, we estimate the Imation CD, DVD or Blu-ray media to have a theoretical readability of up to 30 years. The primary caveat is how you handle and store the media. http://support.tdkperformance.com/app/answers/detail/a_id/1685/~/life-expectancy-of-optical-media