Valhalla

From Archiveteam
Jump to navigation Jump to search
Ms internet on a disc.jpg

This wiki page is a collection of ideas for Project Valhalla.

This project/discussion has come around because there is a class of data currently existing, several times a year, as a massive amount of data with "large, but nominal" status within the Internet Archive. The largest example is currently MobileMe, which is hundreds of terabytes in the Internet Archive system (and in need of WARC conversion), which represents a cost amount far outstripping its use. Another is TwitPic, which is currently available (and might continue to be available) but which has shown itself to be a bad actor with regards to longevity and predictability for its sunset.

Therefore, there is an argument that there could be a "third place" that data collected by Archive Team could sit, until the Internet Archive (or another entity) grows its coffers/storage enough that 80-100tb is "no big deal", just like 1tb of data was annoying in 2009 and now is totally understandable for the value, i.e. Geocities.

This is for short-term (or potentially also long-term) storage options, say five years or less, of data generated by Archive Team.

  • What options are out there, generally?
  • What are the costs, roughly?
  • What are the positives and negatives?

There has been a lot of study in this area over the years, of course, so links to known authorities and debates will be welcome as well.

Join the discussion in #huntinggrounds.

Goals

We want to:

  • Dump an unlimited[1] amount of data into something.
  • Recover that data at any point.

We do not care about:

  • Immediate or continuous availability.

We absolutely require:

  • Low (ideally, zero) human time for maintenance. If we have substantial human maintenance needs, we're probably going to need a Committee of Elders or something.
  • Data integrity. The storage medium must be impossibly durable or make it inexpensive/easy to copy and verify the data onto a fresh medium.

It would be nice to have:

  • No special environmental requirements that could not be handled by a third party. (So nobody in Archive Team would have to set up some sort of climate-controlled data-cave; however, if this is already something that e.g. IA does and they are willing to lease space, that's cool.)

What does the Internet Archive do for this Situation, Anyway?

This section has not been cleared by the Internet Archive, and so should be considered a rough sketch.

The Internet Archive primarily wants "access" to the data it stores, so the primary storage methodology is spinning hard drives connected to a high-speed connection from multiple locations. These hard drives are between 4-6tb (as of 2014) and are of general grade, as is most of the hardware - the theory is that replacing cheap hardware is better than spending a lot of money on super-grade hardware (whatever that may be) and not being able to make the dollars stretch. Hundreds of drives die in a month and the resiliency of the system allows them all to hot-swap in replacements.

There are multiple warehouses for storing the original books that are scanned, as well as materials like CD-ROMs and even hard drives. There are collections of tapes and CD-ROMs from previous iterations of storage, although they are thought of as drop-dead options instead of long-term archival storage - the preference is, first and foremost, the spinning hard drives.

The Archive does not generally use tape technology, having run into the classic "whoops, no tape drive on earth reads these any more" and "whoops, this tape no longer works properly".

The Archive has indicated that if Archive Team uses a physical storage method, such as tapes, paper, hard drives or anything else, that they are willing to store these materials "as long as they are exceedingly labelled".

Physical Options

Storage type Cost ($/TB/year) Storage density (m³/TB) Theoretical lifespan Practical, tested lifespan Notes
Hard drives (simple distributed pool) $150 (full cost of best reasonable 1TB+ external HD) September 2014, best reasonable 1TB+ external HD is a 4TB WD. 25+ pool members would need one HD each plus a computer plus software to distribute data across the entire pool.
Hard drives (dedicated distributed pool) An off-the-shelf or otherwise specified, dedicated, network storage device used exclusively as part of a distributed pool.
Hard drives (SPOF) [2] $62 (but you have to buy 180TB) For a single location to provide all storage needs, building a Backblaze Storage Pod 4.0 runs an average of $11,000, providing 180TB of non-redundant, not-highly-available storage. (You really want more than one pod mirroring your data, but this is the most effective way to get that much storage in one place.)
Commercial / archival-grade tapes
Consumer tape systems (VHS, Betamax, cassette tapes, ...)
Vinyl
PaperBack 500KB per letter sheet means 1TB is 2,199,024 sheets, or ~4400 reams (500 sheets each), or an 8'x16' room filled with 6' tall stacks. It would take 63.6 days of continuous printing to do this.[3]
Optar At 200KB per page, this has less than half the storage density of Paperback.
Blu-Ray $40 (50 pack spindle of 25GB BD-Rs) 30 years[4] Lasts a LOT longer than CD/DVD, but should not be assumed to last more than a decade. Raidz3 with Blu-rays Doing a backup in groups of 15 disks. Comes to under $.04/GB which is cheap, and low initial investment (drives) too!


Specifically, a 50pack spindle of 25GB BD-Rs could readily hold 1TB of data for $30-50 per spindle. 50GB and 100GB discs are more expensive per GB.

M-DISC Unproven technology, but potentially interesting.
Flash media Very durable for online use, and usually fails from lots of writes. A drive might never wear out from cold-storage usage. Newer drives can have 10-year warranties. But capacitors may leak charge over time. JEDEC JESD218A only specifies 101 weeks (almost two years) retention without power, so we'd have to check the spec of the specific drives, or power them up and re-write the data to refresh it about once a year. Soliciting donations for old flash media from people, or sponsorship from flash companies?
Glass/metal etching
Amazon Glacier $122.88 (storage only, retrieval billed separately) average annual durability of 99.999999999% [5] Retrieval is billed separately. 5% or less per month into S3 is free (5% of 100TB is 5TB), and data can be copied out from S3 to a SATA HD for $2.50/hr. plus media handling and shipping fees. Downloading 5TB from S3 would cost $614.40 (~$122.88/TB), but only $44.82 to transfer to HD via USB 3 or SATA (USB 2 is slower).
Dropbox for Business $160* ($795/year) Dropbox for Business provides a shared pool of 1TB per user, at $795/year (five user minimum, 5TB), and $125 each additional user/year.
Box.com for Business $180* ("unlimited" storage for $900/year) Box.com for Business provides "unlimited" storage at $15/user/month, five user minimum, or $900/year.
Dedicated colocated storage servers $100* (e.g. $1300 for one year of 12TB rackmount server rental) Rent storage servers from managed hosting colocation providers, and pool data across them. Benefits include bandwidth and electricity being included in the cost, and files could be made available online immediately. Negatives include needing to administer tens of servers.

Software Options

Some of the physical options require supporting software.

Removable media requires a centralized index of who has what discs, where they are, how they are labeled, and what the process for retrieval/distribution is. It could just be a wiki page, but it does require something.

A simple pool of HDs ("simple pool"), one without a shared filesystem, just people offering up HDs, requires software running on Windows, Linux and/or Mac hardware to allow Archive Team workers to learn who has free disk space, and to save content to those disks. This could be just an IRC conversation and SFTP, but the more centralized and automated, the more likely available disk space will be able to be utilized. Software that is not cross-platform cannot be used here.

A simple distributed and redundant pool of HDs ("distributed pool") requires software running on Windows, Linux and Mac hardware to manage a global filesystem or object store, and distribute uploads across the entire pool of available space, and make multiple copies on an ongoing basis to ensure preservation of data if a pool member goes offline. This has to be automated and relatively maintenance-free, and ideally low-impact on CPU and memory if it will be running on personal machines with multi-TB USB drives hanging off them. Software that is not cross-platform cannot be used here.

A dedicated distributed and redundant pool of HDs ("dedicated pool") requires a selection of dedicated hardware and disks for maximum availability, and software to run on that hardware to manage a global filesystem or object store. It has to be automated and relatively maintenance-free, but would be the only thing running on its dedicated hardware, and as such does not have to be cross-platform.

Software name Filesystem or Object Store? Platform(s) License Good for which pool? Pros Cons Notes
Tahoe-LAFS Filesystem Windows, Mac, Linux GPL 2+ Distributed, dedicated Uses what people already have, can spread expenses out, could be a solution done with only software Barrier to leaving is non-existent, might cause data-loss even with auto-fixing infrastructure. Too slow to be a primary offloading site. [6] Accounting is experimental, meaning "in practice is that anybody running a storage node can also automatically shove shit onto it, with no way to track down who uploaded how much or where or what it is" -joepie91 on IRC
Ceph Object store, Filesystem Linux LGPL Dedicated
GlusterFS Filesystem Linux, BSD, OpenSolaris GPL 3 Dedicated
Gfarm Filesystem Mac, Linux, BSD, Solaris X11 Dedicated
Quantcast Filesystem Linux Apache Dedicated Like HDFS, intended for MapReduce processing, which writes large files, and doesn't delete them. Random access and erasing or moving data around may not be performant.
GlusterFS Filesystem Mac, Linux, BSD, Solaris GPL 3 Dedicated
HDFS Filesystem Java Apache Distributed, dedicated Like Quantcast, intended for MapReduce processing, which writes large files, and doesn't delete them. Random access and erasing or moving data around may not be performant.
XtreemFS Filesystem Linux, Solaris BSD Dedicated
MogileFS Object store Linux GPL Dedicated Understands distributing files across multiple networks, not just multiple disks As an object store, you can't just mount it as a disk and dump files onto it, you have to push them into it through its API, and retrieve them the same way.
Riak CS Object store Mac, Linux, BSD Apache Dedicated S3 API compatible Multi-datacenter replication (which might be what you consider having multiple disparate users on different networks) is only available in the commercial offering. A former Basho employee suggests this might not be a good fit due to the high latency and unstable connections we'd be dealing with. Datacenter-to-datacenter sync is an "entirely different implementation" than local replication, and would require the enterprise offering.
MongoDB GridFS Object store Windows, Mac, Linux AGPL Distributed, dedicated
LeoFS Object store Mac, Linux Apache Dedicated S3-compatible interface, beta NFS interface, supports multi-datacenter replication, designed with GUI administration in mind
BitTorrent Sync Synchronization Windows, Mac, Linux, BSD, NAS Proprietary Simple Commercially supported software As straight synchronization software, it mirrors folders across devices. Individual users would have to make synched folders available to get copies of archives, and then they would be mirrored, and that's it. Synchronization software in general is not the right solution for this problem.

Non-options

  • Ink-based Consumer Optical Media (CDs, DVD, etc.)
    • Differences between Blu-Ray and DVD? DVDs do not last very long. The fact is, the history of optical writable media has been on of chicanery, failure, and overpromising while under-delivering. Some DVDs failed within a year. There are claims Blu-Ray is different, but fool me 3,504 times, shame on me.
  • BitTorrent Sync
    • Proprietary (currently), so not a good idea to use as an archival format/platform
  • Amazon S3 / Google Cloud Storage / Microsoft Azure Storage
    • Amazon S3 might be a viable waypoint for intra-month storage ($30.68/TB), but retrieval over the internet, as with Glacier, is expensive, $8499.08 for 100TB. Google's and Microsoft's offerings are all in the same price range.
  • Floppies
    • "Because 1.4 trillion floppies exists less than 700 billion floppies. HYPOTHETICALLY, if you set twenty stacks side by side, figure a quarter centimeter per floppy thickness, excluded the size of the drive needed to read the floppies you would still need a structure 175,000 ft. high to house them. Let's also assume that the failure rate for floppies is about 5% (everyone knows that varies by brand, usage, time of manufacture, materials used, etc, but lets say 5% per year). 70 million of those 1.4 trillion floppies are unusuable. Figuring 1.4 MB per floppy disk, you are losing approximately 100MB of porn each year. Assuming it takes 5 seconds to replace a bad floppy, you would have to spend 97,222 hrs/yr to replace them. Considering there are only 8,760 hrs per year, you would require a staff of 12 people replacing floppies around the clock or 24 people on 12 hr shifts. Figuring $7/hr you would spend $367,920 on labor alone. Figuring a nickel per bad floppy, you would need $3,500,000 annually in floppy disks, bringing your 1TB floppy raid operating costs (excluding electricity, etc) to $3,867, 920 and a whole landfill of corrupted porn. Thank you for destroying the planet and bankrupting a small country with your floppy based porn RAID." (source)

From IRC

<Drevkevac> we are looking to store 100TB+ of media offline for 25+ years
<Drevkevac> if anyone wants to drop in, I will pastebin the chat log
<rat> DVDR and BR-R are not high volume. When you have massive amounts of data, raid arrays have too many points of failure.
<rat> Drevkevac: I work in a tv studio. We have 30+ years worth of tapes. And all of them are still good.
<rat> find a hard drive from 30 years ago and see how well it hooks up ;)
<brousch_> 1500 Taiyo Yuden Gold CD-Rs http://www.mediasupply.com/taiyo-yuden-gold-cd-rs.html
<Drevkevac> still, if its true, you could do, perhaps, raidz3s in groups of 15 disks or so?
<SketchCow> Please add paperbak to the wiki page.
<SketchCow> Fuck Optical Media. not an option;.
<Drevkevac> that would give you ~300GB per disk group, with 3 disks

Where are you going to put it?

Okay, so you have the tech. Now you need a place for it to live.

Possibilities:

  • The Internet Archive Physical Warehouse, Richmond, CA
    • The Internet Archive has several physical storage facilities, including warehouses in Richmond, CA (home of the Physical Archive) and the main location in San Francisco, CA. They have indicated they are willing to take copies of Archive Team-sponsored physical materials with the intent of them being ingested into the Archive at large over time, as costs lower and 100tb collections are not as big a drain (or a rash of funding arrives elsewhere).
  • Living Computer Museum, Seattle, WA
    • In discussions with Jason Scott, the Living Computer Museum has indicated they will have physical storage available for computer historical materials. Depending on the items being saved by Archive Team, they may be willing to host/hold copies for the forseable future.
  • Library of Congress, Washington, DC
    • The Library of Congress may be willing to take a donation of physical storage, although it is not indicated what they may do long-term with it.

Multiple copies would of course be great.

No, seriously, how are you going to actually DO it

There are only a few practical hardware+software+process combinations. In order of cost to each volunteer:

  • A pool of volunteers with Blu-ray burners commit to ("the Blu-ray option"):
    • buying a 50-disc spindle of 25GB discs per TB per project,
    • burning them,
    • verifying them,
    • storing them somewhere climate-controlled (a shelf in a house with AC and heat is fine, an attic/garage/flooded basement is not),
    • verifying them regularly (monthly? quarterly?) and replacing discs if necessary, and
    • shipping them somewhere else upon request, with no expectation of return (permanent storage, consolidation, etc.).

This probably requires a minimum of three volunteers per TB per project. Probably best to pre-split the data into < 25GB chunks so each disc can be labeled the same and expected to have the same data on it. Fifty 25GB discs is a little more than a TB, and it's expected you'll lose a few to bad burns each time, but it might be worth buying more than a spindle and generating parity files onto additional discs.

  • A pool of volunteers commit to ("the simple pool"):
    • buying a best reasonable external HD,
    • downloading archives to it,
    • keeping it spun up, or spinning it up regularly (monthly? quarterly?) and running filesystem and content checks on it,
    • storing it somewhere climate-controlled (a shelf in a house with AC and heat is fine, an attic/garage/flooded basement is not),
    • buying additional HDs once it's full or if there are drive errors, and
    • shipping it somewhere else upon request, with no expectation of return (permanent storage, consolidation, etc.).

Same as with Blu-rays, and not really any more expensive ($150 == $37.50 for one 1TB of Blu-rays * 4, or one 4TB HD), except look at all that disc-swapping time and effort you don't have to do. You don't have to split data into chunks, but you do want to download it in a resumable fashion and verify it afterwards, so, checksums, parity files, something. You also risk losing a lot more if a drive fails, and the cost per-volunteer is higher (replacing a whole drive versus replacing individual discs or spindles). As such, you still probably want a minimum of three volunteers per TB per project (so a 2TB project needs six volunteers with 1TB each, not three volunteers holding all 2TB each).

  • A pool of volunteers commit to ("the distributed pool"):
    • all buying the same, standard, inexpensive, hackable, RAID 1, NAS,
      • WD My Cloud Mirror (starts at $300 for 2TB [called "4TB," only 2TB with mirroring])
      • QNAP (2-bay starts at $140 without HDs)
      • Synology (2-bay starts at $200 without HDs)
      • Pogoplug Series 4 + two best reasonable external HD + software RAID 1, or a download script that manually mirrors files ($20 without HDs)
    • keeping it spun up, online, and possibly accessible by external AT admins,
    • storing it somewhere climate-controlled (a shelf in a house with AC and heat is fine, an attic/garage/flooded basement is not),
    • buying entire additional units once they are full or if there are drive errors, and
    • shipping the drives (or the entire My Cloud Mirror unit, if that's the one selected) somewhere else upon request, with no expectation of return (permanent storage, consolidation, etc.).

These units provide dramatically improved reliability for content, enough that perhaps you only need two volunteers per project, and no need to split by TB, since each volunteer would have two copies. Having everyone buy the same hardware means reduced administration time overall, especially if custom scripts are involved. QNAP and Synology both have official SDKs, and all of them run some flavor of Linux, with Synology supporting SSH logins out of the box. The Pogoplug is the most underpowered of the options, but even it should be powerful enough to run a MogileFS storage node, or a script that downloads to one HD and copies to the other. (Checksums would be really slow, though.) This is moderately expensive per-volunteer, with an upfront cost of $320-$500.

  • A pool of volunteers commit to ("the dedicated pool"):
    • all buying the same, standard, expensive NAS,
      • iXsystems FreeNAS Mini (starts at $1000 without HDs),
      • A DIY FreeNAS box ($300+ without HDs),
      • A DIY NexentaStor box (probably the same as the DIY FreeNAS box)
    • keeping it spun up, online, and possibly accessible by external AT admins,
    • storing it somewhere climate-controlled and well-ventilated (a shelf with no airflow is not fine),
    • replacing drives if there are drive errors,
    • migrating the pool to larger disks once it starts getting full, and
    • shipping the drives somewhere else upon request, with no expectation of return (permanent storage, consolidation, etc.).

A set of volunteers with (comparatively) expensive network-attached storage gives you a lot of storage in a lot of locations, potentially tens of redundant TB in each one, depending on the size of the chassis. You want everyone running the same NAS software, but the hardware can vary somewhat; however, the hardware should all have ECC RAM, and the more the better. MogileFS storage nodes are known to run on NexentaStor, and FreeNAS supports plugins, so it could be adapted to run there, or you could figure out e.g. LeoFS (which also expects ZFS). This is the most expensive option per-volunteer, upfront costs starting at around $1300 for a DIY box with four 4TB WD Red drives.

  • A pool of volunteers set up a recurring payment to fund ("the server option"):
    • one or more rented, managed, storage servers; or
    • saving up to buy one or more storage servers, and then hosting it somewhere.

A rented server has no hardware maintenance costs; replacing a failed HD is the responsibility of the hosting provider, both in terms of materials cost and in labor cost. This is not the case with a purchased server, where someone would have to buy a replacement hard drive, bring it to the colocation center, and replace the drive; or someone would have to buy a replacement disk, ship it to the colocation center, and then they would bill someone for the labor involved in replacing it.

What Can You Contribute?

Name What You Can Contribute For How Long? Exit Strategy
ExampleArchiver Describe what you are willing to buy/build/write/do. Talk about the connection you would use, the storage conditions, etc. How much money can you put into it? For how long can you truly commit to this? If you need to quit or wind down your contribution, what are you willing to do? Can you guarantee a period of notice? Are you willing to ship your hardware or media to another volunteer anywhere in the world, or will you want to keep it?
dnova
  • Willing to burn and maintain a blu-ray collection (can to provide burner and at least some discs).
  • Willing to write/maintain tape library (but cannot provide tape drive/tapes).
  • Willing to participate in simple pool or storage pool, depending on technical details.
  • I can store media in a class 1000 cleanroom!
  • Willing to provide short-term storage for few hundreds of GB of RAIDZ-1 storage on a 75/10 residential connection.
  • 2+ years in my current geographical location and with cleanroom access.
  • Willing to continue indefinitely wherever I go, but some details may change accordingly.
Can give ample notice for either full upload and/or shipping of all media/hardware anywhere in the world.
vitorio
  • Participating in the simple pool (I only have a laptop, so I'd store the HDs offline at home and check them monthly/quarterly)
  • Participating in the distributed pool (residential 30/10 connection)
  • Contributing $100/mo. for the server option
Indefinitely Can give ample notice for either full upload and/or shipping of all hardware anywhere in the world.

Project-specific suggestions

Twitch.tv (and other video services)

  • Keep the original video files in (semi-)offline storage, and store transcoded (compressed) versions on the Internet Archive.

See Also

References

  1. Unlimited doesn't mean infinite, but it does mean that we shouldn't worry about running out of space. We won't be the only expanding data store.
  2. The Internet Archive's cost per TB, with 24/7 online hard drives, is approximately $2000 for forever.
  3. A HP LaserJet 5Si printing 24 pages per minute which generates the 500K bytes per page, yielding approximately 200,000 bytes per second.
  4. On the basis of the described studies and assuming adequate consideration of the specified conditions for storage and handling, as well as verification of data after writing, we estimate the Imation CD, DVD or Blu-ray media to have a theoretical readability of up to 30 years. The primary caveat is how you handle and store the media. http://support.tdkperformance.com/app/answers/detail/a_id/1685/~/life-expectancy-of-optical-media
  5. "Amazon Glacier is designed to provide average annual durability of 99.999999999% for an archive. The service redundantly stores data in multiple facilities and on multiple devices within each facility. To increase durability, Amazon Glacier synchronously stores your data across multiple facilities before returning SUCCESS on uploading archives. Glacier performs regular, systematic data integrity checks and is built to be automatically self-healing." Maciej Ceglowski thinks that's kinda bullshit compared to the failure events you don't plan for, of course.
  6. "Practically the following results have been reported: 16Mbps in throughput for writing and about 8.8Mbps in reading" -- from https://tahoe-lafs.org/trac/tahoe-lafs/wiki/FAQ, making it non-competitive with the 1-2 gigabit speeds needed when archiving twitch.tv.