Talk:INTERNETARCHIVE.BAK
A note on the end-user drives
I feel it is really critical that the drives or directories sitting in the end-user's location be absolutely readable, as a file directory, containing the files. Even if that directory is inside a .tar or .zip or .gz file. Making it into a encrypted item should not happen, unless we make a VERY SPECIFIC, and redundant channel of such a thing. --Jscott 00:01, 2 March 2015 (EST)
- A possibility is that it's encrypted but easy to unencrypt, so that its harder to fake hashes to it but it can be unpacked into useful items even without the main support network there.
Potential solutions to the storage problem
Tape
- Is there a good reason why all these admittedly cool, but rather convoluted and extremely difficult to restore from distributed solutions are being considered before the obvious one -- tape backup? A few T10000D tape drives operating in parallel would back up the Archive vastly faster and vastly more reliably. Enough tapes to store 21 PB would take up less than a cublc meter.
- This project is separate from any "official" project that Internet Archive would like to do. with regards to backups. It is an experiment meant to think out ideas and maybe have a nice ad-hoc approach to the idea of another location. For the record, though, you just quoted a solution (propietary, as a bonus, although I assume you would think LTO was fine) that has an immediate cost of hundreds of thousands of dollars. --Jscott 10:41, 6 March 2015 (EST)
- Here's a report about the effectiveness of LTO over T10000D and other more proprietary solutions (among other things, when Oracle decides to stop making drives, the tapes are much less useful). [1]
Tahoe-LAFS
- Tahoe-LAFS - decentralized (mostly), client-side encrypted file storage grid
- Requires central introducer and possibly gateway nodes
- Any storage node could perform a Sybil attack until a feature for client-side storage node choice is added to Tahoe.
git-annex
- git-annex - allows tracking copies of files in git without them being stored in a repository
- Also provides a way to know what sources exist for a given item. git-annex is not (AFAIK) locked to any specific storage medium. -- yipdw
Right now, git-annex seems to be in the lead. Besides being flexible about the sources of the material in question, the developer is a member of Archive Team AND has been addressing all the big-picture problems for over a year.
Full worked proposed design for using git-annex for this: https://git-annex.branchable.com/design/iabackup/ -- joeyh
Other
- STORJ - blockchain based private cloud storage.
- IPFS - "You can loosely think of ipfs as git + bittorrent + dht + web."
- Permacoin - Repurposing Bitcoin Work for Data Preservation
- Compact Proofs of Retrievability
- Camlistore - your personal storage system for life
See Also
Don't forget about the options enumerated as part of the Valhalla project (particularly the software options); this is much the same thing.
Other anticipated problems
- Users tampering with data - how do we know data a user stored has not been modified since it was pulled from IA?
- Proposed solution: have multiple people make their own collection of checksums of IA files. --Mhazinsk 00:10, 2 March 2015 (EST)
- All IA items already include checksums in the _files.xml. So there could be an effort to back up these xml files in more locations than the data itself (should be feasible since they are individually quite small).
- To prevent false claims of "having the data" when it is actually deleted, perhaps some kind of proof-of-data scheme where an 'authorative node' (ie. trusted) randomly 'challenges' clients to provide a checksum of a given chunk of data, after XORing it with a certain key? That way a node cannot "pre-calculate" checksums, and if it tries to do so anyway, it'll need to store more checksum data than if it were to just store the original files. Definitely needs attention from a cryptographer as to exact implementation, but might work. Does mean that there always needs to be a known-valid copy of the original data, for the proof-of-data scheme to work, so might leave an exposure window when the original data is lost, where a particularly quick adversary could remove the data straight away. Joepie91 backup 20:25, 12 March 2015 (EDT)
- Looks like there are already some proposed solutions for this: htt ps://cseweb.ucsd.edu/~hovav/dist/verstore.pdf and htt p://cs.umd.edu/~amiller/nonoutsourceable.pdf (ref htt ps://news.ycombinator.com/item?id=9159349) (broke up the URLs because the spamfilter is, apparently, broken to the point of not allowing links at all) Joepie91 backup 20:58, 12 March 2015 (EDT)
- Proposed solution: have multiple people make their own collection of checksums of IA files. --Mhazinsk 00:10, 2 March 2015 (EST)
- "Dark" items (e.g. the "Internet Records" collection)
- There are classifications of items within the Archive that should be considered for later waves, and not this initial effort. That includes dark items, television, and others.
- It seems like this would include a lot of what we would want to back up the most though, e.g. a substantial percentage of the books scanned are post-1923 and not public
- There are classifications of items within the Archive that should be considered for later waves, and not this initial effort. That includes dark items, television, and others.
- Data which may be illegal in certain countries/jurisdictions and expose volunteers to legal risk (terrorist propaganda, pornography, etc.)
- Interesting! Several solutions come to mind. --Jscott 02:35, 2 March 2015 (EST)
- User bandwidth (particularly upstream)
- latency in swapping disks - assume we may be using cold storage
- Tiered storage? e.g. one for cloud, one for online trusted users' storage, and one for cold storage
- User laziness
- User motivation
- User trust
- Build a user community --Chfoo 23:18, 2 March 2015 (EST)
- Bitflip, bitrot, data corruption
- De-duplication
- There are many files which are duplicated in the Archive. IA does not do de-duplication themselves. Census data shows hundreds of terabytes of duplicated files. Here is a list of the top 1,000 most wasteful duplicate files.
- We could take this into account when setting redundancy levels. If we target 4x for each file, then maybe we only need 2x for files which are included multiple times.
- We should certainly use something more resilient than MD5 for finding duplicates. I would be surprised if there isn't already an item full of MD5 collisions. --Sep332 00:06, 8 March 2015 (EST)
Project Lab and Corner
- Projects are much easier with the Internet Archive tool, available here.
- There is a _files.xml in each item indicating what files are original and which are derivations.
- Please step forward and write a script that, given a collection, finds all the items in that collection and adds up all the sizes of the original files.
- https://gist.github.com/EricIO/f77f094032110a7b51e7 running `python ia-collection-size.py <collection-name>` will give you the size of the original files and the total.
Some results so far:
Collection | Link to Collection | Number of Items | Total Size | Original Files Size | % of Total |
---|---|---|---|---|---|
Ephemeral Films | [2] | 2932 | 10971882551213 (10.9tb) | 9453160185702 (9.4tb) | 86% |
Computer Magazines | [3] | 13066 | 3392870124693 (3.3tb) | 1897118607284 (1.8tb) | 55% |
Software Library | [4] | 27861 | 63140205942 (63.5gb) | 61142015946 (61.5gb) | 96% |
Prelinger Archive | [5] | 6477 | 14603406806901 (14.6tb) | 13792309835153 (13.7tb) | |
Grateful Dead | [6] | 10006 |
Census
The IA have kindly provided a census of the contents of the Archive.
https://archive.org/details/ia-bak-census_20150304
- 14.23 petabytes
- 14,926,080 items
Case Studies
If you implement it, will users use it?
BOINC
- Why do people participate in BOINC projects?
- Why do projects use BOINC?
- How does BOINC keep track of work units?
- How does BOINC deal with bad actors?
- Why do BOINC projects share project users and points among other projects?
- What makes people download the client software and install it?
Stack Overflow
- Q & A sites existed before Stack Overflow. What makes Stack Overflow so successful?
- How does Stack Overflow eliminate bad questions and answers?
- What makes Stack Exchange grow so large?
- How does it deal with spam?
rule34.paheal.net
The idea "Backing up the Internet Archive" is stated on this page (INTERNETARCHIVE.BAK). Of course backing up the whole thing would be extremely difficult, but parts of it can be backed up.
- For example, in 2018 Wayback Machine archived rule34.paheal.net URLs, but as of now it does not.* Deleted posts and old web pages of rule34.paheal.net (website established in 2007) at Wayback Machine are now seemingly gone forever unless the archive was archived at archive.is (website established in 2012) too.
So my point is that archiving Wayback Machine archives at archive.is could be helpful.
*For example: http://archive.is/Mno0M contains a broken image and the webpage is sorta NSFW; this link is an archive of http://web.archive.org/web/20180312000910/http://rule34.paheal.net/post/view/2520590 (http://rule34.paheal.net/post/view/2520590 = deleted) and was found in http://archive.is/offset=50/http://rule34.paheal.net/post/view/*. Try going to http://web.archive.org/web/20180312000910/http://rule34.paheal.net/ or http://web.archive.org/web/20180312000910/http://rule34.paheal.net/post/view/2520590 now and it will say "Sorry. / This URL has been excluded from the Wayback Machine." Also I personally remember seeing Wayback Machine archives of rule34.paheal.net in 2018 or before. --Usernam (talk) 06:08, 15 April 2019 (UTC)
- It's probably because of the site's robots.txt file, that is respected by the Internet Archive. That is, if it says Disallow for the site or specific paths, IA won't crawl it, and, for already crawled parts of it, IA won't show them in the Wayback Machine. This doesn't mean they deleted already archived stuff; and one day, when robots.txt changes or disappears, archives will show up again in the Wayback Machine.
- It's also possible that IA still archives stuff, just doesn't show them. See this article: https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/ bzc6p (talk) 08:25, 21 April 2019 (UTC)