Difference between revisions of "Internet Archive"

From Archiveteam
Jump to navigation Jump to search
(→‎Uploading to archive.org: torrent broken since early 2025)
 
(95 intermediate revisions by 32 users not shown)
Line 1: Line 1:
{{Infobox project
{{Infobox project
| title = Internet Archive
| logo = IAsquarelogo.png
| image = Internet Archive- Digital Library of Free Books, Movies, Music & Wayback Machine 1292930995846.png
| image = Internet Archive- Digital Library of Free Books, Movies, Music & Wayback Machine 1292930995846.png
| description = Internet Archive mainpage in 2010-12-21
| description = Internet Archive main page in August 2016
| URL = {{url|1=http://www.archive.org}}
| URL = {{url|1=https://archive.org}}
| source = [https://github.com/ArchiveTeam/IA.BAK IA.BAK]
| project_status = {{endangered}}<ref> https://www.vice.com/en_us/article/5dzg8n/archiving-the-internet-archive-sued-by-publishers</ref>
| tracker = [http://teamarchive1.fnf.archive.org/ia.bak/ ia.bak]
| archiving_status = {{onhiatus}}
| project_status = {{online}}
| irc = internetarchive
| archiving_status = {{Inprogress}}
 
| irc = internetarchive.bak
}}
}}
The '''Internet Archive''' is a non-profit digital library with the stated mission/motto: "universal access to all knowledge". The Internet Archive stores over 400 billion webpages from different dates and times for historical purposes that are available through the Wayback Machine, arguably an archivist's wet dream. The Archive.org website also archives books, music, videos, and software.
The '''Internet Archive''' is a non-profit digital library with the stated mission/motto: "universal access to all knowledge". The Internet Archive stores over 900 billion webpages from different dates and times for historical purposes that are available through the Wayback Machine, arguably an archivist's wet dream. The Archive.org website also archives books, music, videos, and software.


== Mirrors ==
== Mirrors ==
There are currently two mirrors of the Internet Archive collection - the official mirror available at archive.org, and a second mirror at Bibliotheca Alexandrina. The former seems to be up and stable while the latter still has its [http://www.bibalex.org/isis/frontend/archive/archive_web.aspx homepage] working but not the rest of the site, which went down around April-May 2023, as [[Bing]] still has cached versions of a few pages from March 2023, viewable by typing in "site:web.archive.bibalex.org" on Bing and pressing the cache button.


There are currently two mirrors of the Internet Archive collection - the official mirror available at archive.org, and a second mirror at Bibliotheca Alexandrina. Both seem to be up and stable.
Some manually-selected collections are also mirrored manually as part of the project [[INTERNETARCHIVE.BAK]]. See that page and the section [[#Backing up the Internet Archive]].


== Raw Numbers ==
== Raw Numbers ==
Line 30: Line 30:
* Unique data: 18.5 PetaBytes
* Unique data: 18.5 PetaBytes
* Total used storage: 50 PetaBytes
* Total used storage: 50 PetaBytes
December 2021:
* 4 data centers, 745 nodes, 28,000 spinning disks
* Wayback Machine: 57 PetaBytes
* Books/Music/Video Collections: 42 PetaBytes
* Unique data: 99 PetaBytes
* Total used storage: 212 PetaBytes
=== Items added per year ===
Search made 21:56, 17 January 2016 (EST) (this is just from the (mutable) "addeddate" metadata, so it might change, although it shouldn't)
{| class="wikitable"
! Year !! Items added
|-
| 2001 || 63
|-
| 2002 || 4,212
|-
| 2003 || 18,259
|-
| 2004 || 61,629
|-
| 2005 || 61,004
|-
| 2006 || 185,173
|-
| 2007 || 334,015
|-
| 2008 || 429,681
|-
| 2009 || 807,371
|-
| 2010 || 813,764
|-
| 2011 || 1,113,083
|-
| 2012 || 1,651,036
|-
| 2013 || 3,164,482
|-
| 2014 || 2,424,610
|-
| 2015 || 3,113,601
|}
<!-- Code to regenerate it:
data = [(y, internetarchive.api.search_items('addeddate:[{} TO {}]'.format(y, y+1)).num_found) for y in range(2001, 2016)]
print '\n'.join(' |-\n | {} || {}'.format(*x) for x in data)
-->


== Uploading to archive.org ==
== Uploading to archive.org ==


Upload any content you manage to preserve! Registering takes a minute.
[https://archive.org/upload/ Upload] any content you manage to preserve! Registering takes a minute.


Tools:
=== Tools ===
* For quick one-shot webpage archiving, use the [https://archive.org/web/ Wayback Machine]'s "Save Page Now" tool.
 
** There's also an awesome JavaScript Bookmarklet and Chrome extension made by @bitsgalore that provide a fast way to submit pages on the Internet Archive. You can get them here: http://www.bitsgalore.org/2014/08/02/How-to-save-a-web-page-to-the-Internet-Archive/
The are three main methods to upload items to Internet Archive programmatically:
* [http://www.archive.org/help/abouts3.txt S3 interface] (for direct usage with curl, or indirect with the tool of your choice.)
* [https://pypi.python.org/pypi/internetarchive internetarchive Python library] is the main tool now, see the extensive https://archive.org/services/docs/api/
** [https://pypi.python.org/pypi/internetarchive internetarchive Python tool] is one such tool.
* [https://github.com/kngenie/ias3upload Handy script for mass upload (ias3upload.pl)] with automatic error checking and retry
* [https://github.com/kngenie/ias3upload Handy script for mass upload] with automatic error checking and retry.
* [https://archive.org/help/abouts3.txt S3 interface] (for direct usage with curl, or indirect with the tool of your choice)
* Torrent upload, useful if you need resume (for huge files or because your bandwidth is insufficient for upload in one go):
** Just create the item, make a torrent with your files in it, name it like the item, and upload it to the item.
** archive.org will connect to you and other peers via a Transmission daemon and keep downloading all the contents till done;
** For a command line tool you can use e.g. mktorrent or buildtorrent, example: <code>mktorrent -a udp://tracker.publicbt.com:80/announce -a udp://tracker.openbittorrent.com:80 -a udp://tracker.ccc.de:80 -a udp://tracker.istole.it:80 -a http://tracker.publicbt.com:80/announce -a http://tracker.openbittorrent.com/announce "DIRECTORYTOUPLOAD"</code> ;
** You can then seed the torrent with one of the many graphical clients (e.g. Transmission) or on the command line (Transmission and rtorrent are the most popular; btdownloadcurses reportedly doesn't work with udp trackers.)
** archive.org will stop the download if the torrent stalls for some time and add a file to your item called "resume.tar.gz", which contains whatever data was downloaded. To resume, delete the '''empty''' file called <code>IDENTIFIER_torrent.txt</code>; then, resume the download by re-deriving the item (you can do that from the Item Manager.) Make sure that there are online peers with the data before re-deriving and don't delete the torrent file from the item.


Don't use FTP upload, try to keep your items below 400 GiB size, add plenty of metadata.
Don't use FTP upload, try to keep your items below 400 GiB size, add plenty of metadata.
=== [[Internet Archive/Save Page Now|Wayback Machine ''Save Page Now'']] ===
* For quick one-shot webpage archiving, use the Wayback Machine's [https://web.archive.org/save/ "Save Page Now" tool].
** See [https://blog.archive.org/2019/10/23/the-wayback-machines-save-page-now-is-new-and-improved/ October 2019 update] for details including access requests.
** Unless you are logged in to you Internet Archive account, you are limited to 150 captures per day.
** To input a list of URLs, https://archive.org/services/wayback-gsheets/ (avoid trying to send many thousands URLs; there's [[Archivebot]] for that)
** There's also an email address savepagenow@archive.org where to send lists of URLs in the body, useful to submit automatic email digests (checked and works as of 2025-08)
Many scripts have been written to use the live proxy:
* JavaScript Bookmarklet and Chrome extension made by @bitsgalore that provide a fast way to submit pages on the Internet Archive. You can get them here: https://www.bitsgalore.org/2014/08/02/How-to-save-a-web-page-to-the-Internet-Archive
* [[UserScript]]: [[User:ATrescue/383915.user.js|''“AutoSave to Internet Archive - Wayback Machine”'']] by user ''“Flare0n”''. Mirrors: {{URL|1=https://userscripts-mirror.org/scripts/show/383915|2=Mirror 1}} {{URL|1=https://greasyfork.org/en/scripts/368062-autosave-to-internet-archive-wayback-machine/|2=Mirror 2}}. (No longer developed since 2014, but still functional.) <!-- (Wayback Machine is not [[Instagram#API|Instagram]], who crippled their API in June 2016.) -->
** [[User:ATrescue/AutoWB.js|Enhanced Edition (AutoWB.js)]].
=== Torrent upload ===
''' This feature is broken [https://irclogs.archivete.am/internetarchive/2025-01-07#l3f862bc1 since early 2025] '''. Retained for historical reference.
Torrent upload, useful if you need resume (for huge files or because your bandwidth is insufficient for upload in one go):
* Just create the item, make a torrent with your files in it, name it like the item, and upload it to the item.
* archive.org will connect to you and other peers via a Transmission daemon and keep downloading all the contents till done;
* For a command line tool you can use e.g. mktorrent or buildtorrent, example: <code>mktorrent -a udp://tracker.publicbt.com:80/announce -a udp://tracker.openbittorrent.com:80 -a udp://tracker.ccc.de:80 -a udp://tracker.istole.it:80 -a http://tracker.publicbt.com:80/announce -a http://tracker.openbittorrent.com/announce "DIRECTORYTOUPLOAD"</code> ;
* You can then seed the torrent with one of the many graphical clients (e.g. Transmission) or on the command line (Transmission and rtorrent are the most popular; btdownloadcurses reportedly doesn't work with udp trackers.)
* archive.org will stop the download if the torrent stalls for some time and add a file to your item called "resume.tar.gz", which contains whatever data was downloaded. To resume, delete the '''empty''' file called <code>IDENTIFIER_torrent.txt</code>; then, resume the download by re-deriving the item (you can do that from the Item Manager.) Make sure that there are online peers with the data before re-deriving and don't delete the torrent file from the item.
=== Formats ===


Formats: anything, but:
Formats: anything, but:
Line 56: Line 122:


This [https://github.com/vmbrasseur/IAS3API/blob/master/specialfiles.md unofficial documentation page] explains various of the special files found in every item.
This [https://github.com/vmbrasseur/IAS3API/blob/master/specialfiles.md unofficial documentation page] explains various of the special files found in every item.
=== Upload speed ===
Quite often, it's hard to use your full bandwidth to/from the Internet Archive, which can be frustrating. The bottleneck may be temporary (check the current [https://monitor.archive.org/weathermap/weathermap.html network speed] and [https://analytics0.archive.org/stats/s3.php s3 errors]) but also persistent, especially if your network is far (e.g. transatlantic connections).
If your connection is slow or unreliable and you're trying to upload a lot of data, it's recommended to use [[User:JustAnotherArchivist|JAA]]'s [https://gitea.arpa.li/JustAnotherArchivist/little-things/src/branch/master/ia-upload-stream <code>ia-upload-stream</code>]. Make sure you have [https://www.python.org/downloads/ Python 3.x] installed, an easier way to do this and potentially fulfil the requirements, make sure [https://pypi.python.org/pypi/internetarchive internetarchive Python library] is installed and functional, then somewhat similar to the internetarchive Python library, the commands are roughly as follows:
  python -m ia-upload-stream --tries <amount> --concurrency <num> --input-file "<filename_as_indicated>" <itemname> "<desired_filename_on_IA>"
'''Where''':
* '''<code><amount></code>''': Is the amount of tries it should iterate before giving up,
* '''<code><num></code>''': Is the number of concurrency to run, start with numbers beyond 1,
* '''<code><filename_as_indicated></code>''': Is the name of the file that you have on your machine that you wish to upload,
* '''<code><itemname></code>''': Is the name of item on Internet Archive, this should be visible in the URI or address bar of the page you are trying to upload file(s) to, without <code>https://archive.org/details/</code>. See example below, and,
* '''<code><desired_filename_on_IA></code>''': Is the name of file you wish for it to appear as on Internet Archive.
'''Example''':
  python -m ia-upload-stream --tries 999 --concurrency 3 --input-file "example.zip" myitem "example.zip"
'''<code><itemname></code>''' must be the name of the item that you have previously created, e.g. via web interface, and that you have rights to upload to that item. This is also known as the identifier.
See e.g. <code>python -m ia-upload-stream --help</code> for more arguments that you can use, like <code>--no-derive</code>.
Some users with Gigabit upstream links or more, on common GNU/Linux operating systems (such as [[wikipedia:Alpine Linux|Alpine]]), have had some success in increasing their upload speed by using more memory on [[wikipedia:TCP_congestion_control#TCP_BBR|TCP congestion control]] and telling the kernel to live with higher latency and lower responsiveness, as in this example:
<pre>
# sysctl net.core.rmem_default=8388608 net.core.rmem_max=8388608 net.ipv4.tcp_rmem="32768 131072 8388608" net.core.wmem_default=8388608 net.core.wmem_max=8388608 net.ipv4.tcp_wmem="32768 131072 8388608" net.core.default_qdisc=fq net.ipv4.tcp_congestion_control=bbr
# sysctl kernel.sched_min_granularity_ns=1000000000 kernel.sched_latency_ns=1000000000 kernel.sched_migration_cost_ns=2147483647 kernel.sched_rr_timeslice_ms=100 kernel.sched_wakeup_granularity_ns=1000000000
</pre>


== Downloading from archive.org ==
== Downloading from archive.org ==


* [https://archive.org/help/wayback_api.php Wayback Machine APIs]
* [https://archive.org/help/wayback_api.php Wayback Machine APIs]
** [https://archive.org/wayback/available?url=archiveteam.org Availability] – data for one capture for a given URL
** [https://web.archive.org/web/timemap/link/archiveteam.org Memento] – data for all captures of a given URL
** [https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server CDX] – data for all captures of a given URL
* Other Wayback Machine APIs used in the website interface, not included in IA's list, include:
** [https://web.archive.org/web/timemap/?url=archiveteam.org&matchType=prefix&collapse=urlkey&output=json&fl=original%2Cmimetype%2Ctimestamp%2Cendtimestamp%2Cgroupcount%2Cuniqcount&filter=!statuscode%3A%5B45%5D..&limit=100 timemap] – data for a given URL prefix; note the <code>limit=100</code> parameter (which serves to prevent accidental downloads of gigabytes of JSON)
** [https://gext-api.archive.org/services/simhash/simhash?year={{CURRENTYEAR}}&url=archiveteam.org&compress=1 simhash] – hashes (<code>compress=0</code>), or the degree of change in content between consecutive captures (<code>compress=1</code>), for captures of a given URL for a given year
** [https://web.archive.org/__wb/calendarcaptures/2?url=archiveteam.org&date={{CURRENTYEAR}}&groupby=day calendarcaptures] – data for a given URL for a given year or day
** [https://web.archive.org/__wb/sparkline?url=archiveteam.org&collection=web&output=json sparkline] – summary of data for a given URL
** [https://web.archive.org/__wb/search/host?q=archiveteam.org host] – any hosts/domains detected for a given URL
** [https://web.archive.org/__wb/search/metadata?q=archiveteam.org metadata] – metadata for a given host/domain
** [https://web.archive.org/__wb/search/anchor?q=archiveteam anchor] – host/domain keyword search
* [https://pypi.python.org/pypi/internetarchive internetarchive Python tool]
* [https://pypi.python.org/pypi/internetarchive internetarchive Python tool]
** When searching, you can specify the sort order by providing a list of field names, switching to descending order by suffixing the string with " desc".
* Manually, from an individual item: click "HTTPS"; or replace <code>details</code> with <code>download</code> in the URL and reload. This will take you to a page with a link to download a ZIP containing the original files and metadata.
* Manually, from an individual item: click "HTTPS"; or replace <code>details</code> with <code>download</code> in the URL and reload. This will take you to a page with a link to download a ZIP containing the original files and metadata.
* In bulk: see http://blog.archive.org/2012/04/26/downloading-in-bulk-using-wget/
* In bulk: see https://blog.archive.org/2012/04/26/downloading-in-bulk-using-wget/
* There's also an [https://gist.github.com/garyrh/2a373cc5a097433471fa unofficial shell function] that checks how many URLs the Wayback Machine lists for a domain name.
* There's also an [https://gist.github.com/garyrh/2a373cc5a097433471fa unofficial shell function] that checks how many URLs the Wayback Machine lists for a domain name.
* Individual files within .zip and .tar archives can be listed, and downloaded, by appending a slash after the /download/ URL. This will bring up a listing of the content, from a URL with zipviewer.php in it. For example: https://archive.org/download/CreativeComputing_v03n06_NovDec1977/Creative_Computing_v03n06_Nov_Dec_1977_jp2.zip/
* To download a raw, unmodified page from the Wayback Machine, add "id_" to the end of the timestamp, e.g.
https://web.archive.org/web/20130806040521id_/http://faq.web.archive.org/page-without-wayback-code/
* There are also some other codes that can be added to the end of the timestamp, as described here: {{url|http://archive-access.sourceforge.net/projects/wayback/administrator_manual.html#Archival_URL_Replay_Mode}}
<blockquote>
* id_ Identity - perform no alterations of the original resource, return it as it was archived.
* js_ Javascript - return document marked up as javascript.
* cs_ CSS - return document marked up as CSS.
* im_ Image - return document as an image.
* if_ Iframe - Used by default for frames and videos. Usually works for images too.
* oe_ - Hides the Wayback toolbar upon loading.
</blockquote>
=== robots.txt and the Wayback Machine ===
The Internet Archive used to respect a site's [[robots.txt]] file. If that file blocked the ia_archiver user-agent (either directly or with a wildcard rule) the Internet Archive would not crawl the disallowed paths and it would block access through the Wayback Machine to all previously-crawled content matching the disallowed paths until the robots.txt entry is removed. If a site returned a server error when its robots.txt is requested the IA also interpreted that as a 'Disallow: /' rule. From e-mail correspondence with info@archive.org on Jun 10, 2016 regarding a site returning a 503 HTTP status code for its robots.txt:
<blockquote>The internet Archive respects the privacy of site owners, and therefore, when an error message is returned when trying to retrieve a website’s robots.txt, we consider that as "Disallow: /". -Benjamin</blockquote>
As of April 2017, the Internet Archive is no longer fully respecting robots.txt<ref>{{URL|https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/}}</ref>, although this change may not be visible on all archived sites yet. Alexa's crawler still respects robots.txt<ref>{{URL|https://support.alexa.com/hc/en-us/articles/200450194-Alexa-s-Web-and-Site-Audit-Crawlers}}</ref>, and Archive-It respects robots.txt by default<ref>{{URL|https://support.archive-it.org/hc/en-us/articles/208001096-Avoid-robots-txt-exclusions}}</ref>. Users can still request that their [[List of websites excluded from the Wayback Machine|domain be excluded from the Wayback Machine]].
Note that if the content is available in the form of web archive (WARC) file through the IA's normal collections the WARC file may still be downloaded even if the content is not accessible through the Wayback Machine.
== Browsing ==
There are 6 top-level collections in the Archive, which pretty-much everything else is under. These are:
* {{IA id|web}} -- Web Crawls
* {{IA id|texts}} -- eBooks  and Texts
* {{IA id|movies}} -- Moving Image Archive
* {{IA id|audio}} -- Audio Archive
* {{IA id|software}} -- The Internet Archive Software Collection
* {{IA id|image}} -- Images
This is an incomplete list of significant sub-collections within the toplevel ones:
* {{IA id|texts}} -- eBooks  and Texts
** {{IA id|opensource}} -- Community Texts
* {{IA id|movies}} -- Moving Image Archive
** {{IA id|opensource_movies}} -- Community Video
** {{IA id|television}} -- Television
*** {{IA id|adviews}} -- AdViews
*** {{IA id|tv}} -- TV News Search & Borrow
** {{IA id|tvarchive}} -- Television Archive (where the content in the "TV News Search & Borrow is located; not directly accessible)
* {{IA id|audio}} -- Audio Archive
** {{IA id|opensource_audio}} -- Community Audio
** {{IA id|etree}} -- Live Music Archive
** {{IA id|librivoxaudio}} -- The LibriVox Free Audiobook Collection
* {{IA id|software}} -- The Internet Archive Software Collection
** {{IA id|301works}} -- 301Works.org
** {{IA id|consolelivingroom}} -- Console Living Room
** {{IA id|coverdiscs}} -- CD and DVD Coverdisc Collection
** {{IA id|softwarelibrary}} -- Software Library
** {{IA id|open_source_software}} -- Community Software
* {{IA id|image}} -- Images
** {{IA id|flickrcommons}} -- Flickr Commons Archive
** {{IA id|maps_usgs}} -- USGS Maps
** {{IA id|nasa}} -- NASA Images
** {{IA id|coverartarchive}} -- Cover Art Archive
[[Internet Archive/Collections]] is a list of all the collections that contain other collections.


== Backing up the Internet Archive ==
== Backing up the Internet Archive ==
A discussion has begun about creating a distributed backup of the content of the Internet Archive. This is currently in the planning/testing phase. For the initial manifesto, see the [[INTERNETARCHIVE.BAK]] page, for the records of the brainstorming, see its [[Talk:INTERNETARCHIVE.BAK|talk page]], and to follow the discussion in real-time, join the '''#internetarchive.bak''' IRC channel on EFNet.
The contents of the Wayback Machine as of 2002 (and again in 2006) have been duplicated in Alexandria, Egypt, available via https://www.bibalex.org/isis/frontend/archive/archive_web.aspx .


'''UPDATE:''' The [[INTERNETARCHIVE.BAK/git-annex implementation|git-annex implementation]] is now testable and started with archiving some collections of the Archive. If you have a few hundred gigabytes of spare space (that you are willing to sacfirice for a long time), ArchiveTeam counts on your contribution.
In April 2015, ArchiveTeam founder [[user:jscott|Jason Scott]] came up with an idea of a distributed backup of the Internet Archive. In the following months, the necessary tools got developed and volunteers with spare disk space appeared, and now tens of terabytes of rare and precious digital content of the Archive have already been cloned in several copies around the world. The project is open to everyone who has got at least a few hundred gigabytes of disk space that they can sacrifice on the medium or long term. For details, see the [[INTERNETARCHIVE.BAK]] page.


<small>Let us clarify once again: ArchiveTeam is '''not''' the Internet Archive. This "backing up the Internet Archive" project, just like all the other website-rescuing ArchiveTeam projects are '''not''' ordered, asked for, organized or supported by the Internet Archive, nor are the ArchiveTeam members the employees of the Internet Archive (except a few ones). Besides accepting – and, in this case, providing – the content, the Internet Archive doesn't collaborate with the ArchiveTeam.</small>
<small>Let us clarify once again: ArchiveTeam is '''not''' the Internet Archive. This "backing up the Internet Archive" project, just like all the other website-rescuing ArchiveTeam projects are '''not''' ordered, asked for, organized or supported by the Internet Archive, nor are the ArchiveTeam members the employees of the Internet Archive (except a few ones). Besides accepting – and, in this case, providing – the content, the Internet Archive doesn't collaborate with the ArchiveTeam.</small>


=== Archiving status ===
Most of the directly downloadable items at IA are also available as torrents -- at any given time some fraction of these have external seeders, although as of 01:46, 17 February 2016 (EST) there is a problem with IA's trackers where they refuse to track many of the torrents.
 
== Copyright lawsuit==
 
The Internet Archive has faced a [https://en.wikipedia.org/wiki/Hachette_v._Internet_Archive lawsuit] from publishers over making digital copies of copyright works available. In September 2024 they lost at the Second Circuit Court of Appeals. If the level of damages awarded threatens their existence, then we may need to step at very short notice to rescue their content.
 
== Technical notes ==
The history of tasks run on each item can be viewed (when logged in) by going to a URL of the form http:// archive.org/history/''IDENTIFIER'' (where ''IDENTIFIER'' is the id of the item, e.g. the part after "/details/" in a typical IA url).
 
Some of the task commands include:
; archive.php : Initial uploading, adding of reviews, and other purposes ([https://archive.org/history/Esmpc20040427 example])
; bup.php : ''B''acking ''UP'' items from their primary to their secondary storage location after they are modified (always appears last in any group of tasks) ([https://archive.org/history/Esmpc20040427 example])
; derive.php : Handles generating the derived data formats (e.g. converting audio files into mp3s, OCRing scanned texts, generating CDX indexes for WARCs) ([https://archive.org/history/Esmpc20040427 example])
; book_op.php : ? Includes virus scan, which usually takes a while. ([https://archive.org/history/Esmpc20040427 example])
; fixer.php : ? ([https://archive.org/history/Esmpc20040427 example])
; create.php : ? ([https://archive.org/history/gov.house.energycommerce.021207.eaq.hrg example])
; checkin.php : ? ([https://archive.org/history/gov.house.energycommerce.021207.eaq.hrg example])
; delete.php : Used early on (i.e. ~2007) to delete a few items -- not used (except on some test files) since, apparently. ([https://archive.org/history/gov.house.energycommerce.021207.eaq.hrg example])
; make_dark.php : Removes an item from public view; used for spam, malware, copyright issues, etc. ([https://archive.org/history/01SewingABeltTheArielKidsElasticBeltTruthBeltsVeganVegetarianBeltsFashion example])
; modify_xml.php : Modify the metadata of an item (?) ([https://archive.org/history/00004Muxed example])
; make_undark.php : Reverses the effect of make_dark.php ([https://archive.org/history/identifier example])


You can find an initial graph of the status of the testing shard [http://teamarchive1.fnf.archive.org/ia.bak/ here], and exact numbers [http://iabak.archiveteam.org/stats/SHARD1 here].
== Problems ==
* Wayback Machine and Internet Archive suffered from slowdown and long loading times when you go through something, particularly in high-speed internet.
* Some websites cannot be archived by SPN due to the website itself, or to incorrect SPN behavior:
** Some URLs are [[Internet_Archive/Save_Page_Now#Blocks|blocked]] by IA from being archived via SPN
** IRCCloud pastes: the API and URLs return a blank HTTP 400 error to SPN
** updates.cdn-apple.com: returns a blank HTTP 400 error to SPN
* URLs are always normalized when they are indexed by the WBM. This means it cannot differentiate between the capitalization, protocol, or www subdomain of the URL. For example, <code>https://web.archive.org/web/20210125024207/http://www.wiki.archiveteam.org/INDEX.pHp</code> and <code>https://web.archive.org/web/20210125024207/https://wiki.archiveteam.org/index.php</code> link to the same capture, even though the latter is what actually was saved.
* In [https://blog.archive.org/2024/10/18/internet-archive-services-update-2024-10-17/ 2024] IA was targeted with DDOS and account data theft attacks.
* In 2024 some user accounts got deleted due to an admin error, they cannot be re-registered and info@archive.org refuses to change the situation. Reviews, forum posts, lists and "My Web Archive" may have been deleted and cannot be restored yet. Uploaded items are still available, but have "Uploaded by Unknown" on them. The only option is to register a different account name, when using the same email address uploads will be re-associated with the new account, but other data will still be lost.
* In 2025, the SPN email API appeared to have stopped working, but actually it was severely backlogged, with ~40 day processing times.
* In 2018, HEAD requests to the SPN API stopped working, but are working again in 2025.
* Since the hack, site visitors without JavaScript enabled get some URLs redirected to the same URLs with ?noscript=true that return a HTTP 404 error. info@archive.org did not reply to a report about this. Examples: [https://archive.org/?noscript=true]
* In 2017, some functions of the WBM added a requirement for JavaScript to be enabled in visitor's browsers.
* The HTML that the WBM adds to pages is not valid XML, which means it breaks XHTML pages. The fix is easy, info@archive.org forwarded it to engineers at least twice, but it was never fixed. Examples [https://web.archive.org/web/20220823113022/https://www.devever.net/~hl/ortega] [https://web.archive.org/web/20201124042239/http://bjh21.me.uk/all-escapes/all-escapes.xhtml].
* The SPN screenshots option doesn't also enable saving DOM snapshots.
* The SPN service with outlinks enabled does not have request rate limits, which [https://bsky.app/profile/xkeeper.net/post/3lrtokyll3k2s can overload some sites].
* WBM often returns "504 Gateway Time-out" when attempting to download very large files. Example: [https://web.archive.org/web/20250331190905/https://swcdn.apple.com/content/downloads/40/16/082-11498-A_J7T1GLHFVZ/chr3rxmbukm8zmyun90r1gz1wodsaeuzda/InstallAssistant.pkg]
* While saving certain pages, some of them failed to save properly. Example: ''Job Failed'', ''Save Page Now could not capture this URL because it was unreachable.'', and ''SPN internal proxy error''. (These issues may be fixed by clicking the "error report" link.)
** When you click an error report when having issues with saving pages, it may not respond for some reason.


== See also ==
== See also ==
* [[Working with ARCHIVE.ORG]]
* [[Working with ARCHIVE.ORG]]
* [[Internet Archive/Advanced Search]] -- a copy of the documentation without the browser-breaking thousand-item dropdowns on the actual page
* [[Internet Archive/Advanced Search]] -- a copy of the documentation without the browser-breaking thousand-item dropdowns on the actual page
* [[Internet Archive Census]]
* https://internetarchive.archiveteam.org/ -- Hitchhiker's Guide to the Internet Archive, a non-official wiki about the Internet Archive
* [[Archive_Services|Alternative archiving services]]


== External links ==
== External links ==
* {{url|1=http://www.archive.org}}
* {{url|1=https://archive.org}}
* https://help.archive.org/
* {{url|1=http://archive.bibalex.org|2=Bibliotheca Alexandrina mirror}}
* {{url|1=http://archive.bibalex.org|2=Bibliotheca Alexandrina mirror}}
* {{url|1=http://www.archive.org/web/petabox.php|2=Petabox details}}
* {{url|1=https://archive.org/web/petabox.php|2=Petabox details}}
* {{url|1=https://pypi.python.org/pypi/internetarchive|2=A python interface to archive.org}}
* {{url|1=https://pypi.python.org/pypi/internetarchive|2=A python interface to archive.org}}
* {{url|https://archive.org/help/json.php|JSON API for archive.org services and metadata}}
* {{url|https://developers.archive.org/|Developer portal (beta)}}
* {{url|https://blog.archive.org/developers/|Old developer portal}}
* {{url|https://en.wikipedia.org/wiki/Help:Using_the_Wayback_Machine|English Wikipedia page on Help:Using the Wayback Machine}}
* {{url|https://en.wikipedia.org/wiki/Lists_of_Internet_Archive%27s_collections|English Wikipedia page: Lists of Internet Archive's collections}}
* {{url|1=http://news.oreilly.com/2008/06/gordon-mohr-takes-us-inside-th.html|2=Gordon Mohr Takes Us Inside the Internet Archives}} -- Interview from June 18, 2008, mentions Alexandria copy
* {{url|https://monitor.archive.org/weathermap/weathermap.html}}
* {{url|https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server}} - More useful documentation of the Wayback CDX endpoint
* {{url|https://archive.org/web/researcher/cdx_legend.php}} - Mostly unhelpful documentation of the cdx format.
* {{url|https://github.com/internetarchive/CDX-Writer}} - "Python script to create CDX index files of WARC data"
* {{IA item|ArchiveAcademy}} -- a number of internal-focused presentations by IA staff on implementation-details
* {{url|https://addons.mozilla.org/firefox/addon/archive-webextension/|2=archive-webextension}} - Firefox add-on for saving pages into the Internet Archive
* {{url|https://github.com/hartator/wayback-machine-downloader}}
'''Unofficial mobile apps'''
* https://play.google.com/store/apps/details?id=com.internet.archive
== References ==
<references />


{{Navigation box}}
{{Navigation box}}

Latest revision as of 17:22, 23 September 2025

The Internet Archive is a non-profit digital library with the stated mission/motto: "universal access to all knowledge". The Internet Archive stores over 900 billion webpages from different dates and times for historical purposes that are available through the Wayback Machine, arguably an archivist's wet dream. The Archive.org website also archives books, music, videos, and software.

Mirrors

There are currently two mirrors of the Internet Archive collection - the official mirror available at archive.org, and a second mirror at Bibliotheca Alexandrina. The former seems to be up and stable while the latter still has its homepage working but not the rest of the site, which went down around April-May 2023, as Bing still has cached versions of a few pages from March 2023, viewable by typing in "site:web.archive.bibalex.org" on Bing and pressing the cache button.

Some manually-selected collections are also mirrored manually as part of the project INTERNETARCHIVE.BAK. See that page and the section #Backing up the Internet Archive.

Raw Numbers

December 2010:

  • 4 data centers, 1,300 nodes, 11,000 spinning disks
  • Wayback Machine: 2.4 PetaBytes
  • Books/Music/Video Collections: 1.7 PetaBytes
  • Total used storage: 5.8 PetaBytes

August 2014:

  • 4 data centers, 550 nodes, 20,000 spinning disks
  • Wayback Machine: 9.6 PetaBytes
  • Books/Music/Video Collections: 9.8 PetaBytes
  • Unique data: 18.5 PetaBytes
  • Total used storage: 50 PetaBytes

December 2021:

  • 4 data centers, 745 nodes, 28,000 spinning disks
  • Wayback Machine: 57 PetaBytes
  • Books/Music/Video Collections: 42 PetaBytes
  • Unique data: 99 PetaBytes
  • Total used storage: 212 PetaBytes

Items added per year

Search made 21:56, 17 January 2016 (EST) (this is just from the (mutable) "addeddate" metadata, so it might change, although it shouldn't)

Year Items added
2001 63
2002 4,212
2003 18,259
2004 61,629
2005 61,004
2006 185,173
2007 334,015
2008 429,681
2009 807,371
2010 813,764
2011 1,113,083
2012 1,651,036
2013 3,164,482
2014 2,424,610
2015 3,113,601

Uploading to archive.org

Upload any content you manage to preserve! Registering takes a minute.

Tools

The are three main methods to upload items to Internet Archive programmatically:

Don't use FTP upload, try to keep your items below 400 GiB size, add plenty of metadata.

Wayback Machine Save Page Now

  • For quick one-shot webpage archiving, use the Wayback Machine's "Save Page Now" tool.
    • See October 2019 update for details including access requests.
    • Unless you are logged in to you Internet Archive account, you are limited to 150 captures per day.
    • To input a list of URLs, https://archive.org/services/wayback-gsheets/ (avoid trying to send many thousands URLs; there's Archivebot for that)
    • There's also an email address savepagenow@archive.org where to send lists of URLs in the body, useful to submit automatic email digests (checked and works as of 2025-08)

Many scripts have been written to use the live proxy:

Torrent upload

This feature is broken since early 2025 . Retained for historical reference.

Torrent upload, useful if you need resume (for huge files or because your bandwidth is insufficient for upload in one go):

  • Just create the item, make a torrent with your files in it, name it like the item, and upload it to the item.
  • archive.org will connect to you and other peers via a Transmission daemon and keep downloading all the contents till done;
  • For a command line tool you can use e.g. mktorrent or buildtorrent, example: mktorrent -a udp://tracker.publicbt.com:80/announce -a udp://tracker.openbittorrent.com:80 -a udp://tracker.ccc.de:80 -a udp://tracker.istole.it:80 -a http://tracker.publicbt.com:80/announce -a http://tracker.openbittorrent.com/announce "DIRECTORYTOUPLOAD" ;
  • You can then seed the torrent with one of the many graphical clients (e.g. Transmission) or on the command line (Transmission and rtorrent are the most popular; btdownloadcurses reportedly doesn't work with udp trackers.)
  • archive.org will stop the download if the torrent stalls for some time and add a file to your item called "resume.tar.gz", which contains whatever data was downloaded. To resume, delete the empty file called IDENTIFIER_torrent.txt; then, resume the download by re-deriving the item (you can do that from the Item Manager.) Make sure that there are online peers with the data before re-deriving and don't delete the torrent file from the item.

Formats

Formats: anything, but:

  • Sites should be uploaded in WARC format;
  • Audio, video, books and other prints are supported from a number of formats;
  • For .tar and .zip files archive.org offers an online browser to search and download the specific files one needs, so you probably want to use either unless you have good reasons (e.g. if 7z or bzip2 reduce the size tenfold).

This unofficial documentation page explains various of the special files found in every item.

Upload speed

Quite often, it's hard to use your full bandwidth to/from the Internet Archive, which can be frustrating. The bottleneck may be temporary (check the current network speed and s3 errors) but also persistent, especially if your network is far (e.g. transatlantic connections).

If your connection is slow or unreliable and you're trying to upload a lot of data, it's recommended to use JAA's ia-upload-stream. Make sure you have Python 3.x installed, an easier way to do this and potentially fulfil the requirements, make sure internetarchive Python library is installed and functional, then somewhat similar to the internetarchive Python library, the commands are roughly as follows:

 python -m ia-upload-stream --tries <amount> --concurrency <num> --input-file "<filename_as_indicated>" <itemname> "<desired_filename_on_IA>"

Where:

  • <amount>: Is the amount of tries it should iterate before giving up,
  • <num>: Is the number of concurrency to run, start with numbers beyond 1,
  • <filename_as_indicated>: Is the name of the file that you have on your machine that you wish to upload,
  • <itemname>: Is the name of item on Internet Archive, this should be visible in the URI or address bar of the page you are trying to upload file(s) to, without https://archive.org/details/. See example below, and,
  • <desired_filename_on_IA>: Is the name of file you wish for it to appear as on Internet Archive.

Example:

 python -m ia-upload-stream --tries 999 --concurrency 3 --input-file "example.zip" myitem "example.zip"

<itemname> must be the name of the item that you have previously created, e.g. via web interface, and that you have rights to upload to that item. This is also known as the identifier. See e.g. python -m ia-upload-stream --help for more arguments that you can use, like --no-derive.

Some users with Gigabit upstream links or more, on common GNU/Linux operating systems (such as Alpine), have had some success in increasing their upload speed by using more memory on TCP congestion control and telling the kernel to live with higher latency and lower responsiveness, as in this example:

# sysctl net.core.rmem_default=8388608 net.core.rmem_max=8388608 net.ipv4.tcp_rmem="32768 131072 8388608" net.core.wmem_default=8388608 net.core.wmem_max=8388608 net.ipv4.tcp_wmem="32768 131072 8388608" net.core.default_qdisc=fq net.ipv4.tcp_congestion_control=bbr
# sysctl kernel.sched_min_granularity_ns=1000000000 kernel.sched_latency_ns=1000000000 kernel.sched_migration_cost_ns=2147483647 kernel.sched_rr_timeslice_ms=100 kernel.sched_wakeup_granularity_ns=1000000000

Downloading from archive.org

  • Wayback Machine APIs
    • Availability – data for one capture for a given URL
    • Memento – data for all captures of a given URL
    • CDX – data for all captures of a given URL
  • Other Wayback Machine APIs used in the website interface, not included in IA's list, include:
    • timemap – data for a given URL prefix; note the limit=100 parameter (which serves to prevent accidental downloads of gigabytes of JSON)
    • simhash – hashes (compress=0), or the degree of change in content between consecutive captures (compress=1), for captures of a given URL for a given year
    • calendarcaptures – data for a given URL for a given year or day
    • sparkline – summary of data for a given URL
    • host – any hosts/domains detected for a given URL
    • metadata – metadata for a given host/domain
    • anchor – host/domain keyword search
  • internetarchive Python tool
    • When searching, you can specify the sort order by providing a list of field names, switching to descending order by suffixing the string with " desc".
  • Manually, from an individual item: click "HTTPS"; or replace details with download in the URL and reload. This will take you to a page with a link to download a ZIP containing the original files and metadata.
  • In bulk: see https://blog.archive.org/2012/04/26/downloading-in-bulk-using-wget/
  • There's also an unofficial shell function that checks how many URLs the Wayback Machine lists for a domain name.
  • Individual files within .zip and .tar archives can be listed, and downloaded, by appending a slash after the /download/ URL. This will bring up a listing of the content, from a URL with zipviewer.php in it. For example: https://archive.org/download/CreativeComputing_v03n06_NovDec1977/Creative_Computing_v03n06_Nov_Dec_1977_jp2.zip/
  • To download a raw, unmodified page from the Wayback Machine, add "id_" to the end of the timestamp, e.g.
https://web.archive.org/web/20130806040521id_/http://faq.web.archive.org/page-without-wayback-code/
  • id_ Identity - perform no alterations of the original resource, return it as it was archived.
  • js_ Javascript - return document marked up as javascript.
  • cs_ CSS - return document marked up as CSS.
  • im_ Image - return document as an image.
  • if_ Iframe - Used by default for frames and videos. Usually works for images too.
  • oe_ - Hides the Wayback toolbar upon loading.


robots.txt and the Wayback Machine

The Internet Archive used to respect a site's robots.txt file. If that file blocked the ia_archiver user-agent (either directly or with a wildcard rule) the Internet Archive would not crawl the disallowed paths and it would block access through the Wayback Machine to all previously-crawled content matching the disallowed paths until the robots.txt entry is removed. If a site returned a server error when its robots.txt is requested the IA also interpreted that as a 'Disallow: /' rule. From e-mail correspondence with info@archive.org on Jun 10, 2016 regarding a site returning a 503 HTTP status code for its robots.txt:

The internet Archive respects the privacy of site owners, and therefore, when an error message is returned when trying to retrieve a website’s robots.txt, we consider that as "Disallow: /". -Benjamin

As of April 2017, the Internet Archive is no longer fully respecting robots.txt[2], although this change may not be visible on all archived sites yet. Alexa's crawler still respects robots.txt[3], and Archive-It respects robots.txt by default[4]. Users can still request that their domain be excluded from the Wayback Machine.

Note that if the content is available in the form of web archive (WARC) file through the IA's normal collections the WARC file may still be downloaded even if the content is not accessible through the Wayback Machine.

Browsing

There are 6 top-level collections in the Archive, which pretty-much everything else is under. These are:

  • web -- Web Crawls
  • texts -- eBooks and Texts
  • movies -- Moving Image Archive
  • audio -- Audio Archive
  • software -- The Internet Archive Software Collection
  • image -- Images

This is an incomplete list of significant sub-collections within the toplevel ones:

  • movies -- Moving Image Archive
    • opensource_movies -- Community Video
    • television -- Television
      • adviews -- AdViews
      • tv -- TV News Search & Borrow
    • tvarchive -- Television Archive (where the content in the "TV News Search & Borrow is located; not directly accessible)

Internet Archive/Collections is a list of all the collections that contain other collections.

Backing up the Internet Archive

The contents of the Wayback Machine as of 2002 (and again in 2006) have been duplicated in Alexandria, Egypt, available via https://www.bibalex.org/isis/frontend/archive/archive_web.aspx .

In April 2015, ArchiveTeam founder Jason Scott came up with an idea of a distributed backup of the Internet Archive. In the following months, the necessary tools got developed and volunteers with spare disk space appeared, and now tens of terabytes of rare and precious digital content of the Archive have already been cloned in several copies around the world. The project is open to everyone who has got at least a few hundred gigabytes of disk space that they can sacrifice on the medium or long term. For details, see the INTERNETARCHIVE.BAK page.

Let us clarify once again: ArchiveTeam is not the Internet Archive. This "backing up the Internet Archive" project, just like all the other website-rescuing ArchiveTeam projects are not ordered, asked for, organized or supported by the Internet Archive, nor are the ArchiveTeam members the employees of the Internet Archive (except a few ones). Besides accepting – and, in this case, providing – the content, the Internet Archive doesn't collaborate with the ArchiveTeam.

Most of the directly downloadable items at IA are also available as torrents -- at any given time some fraction of these have external seeders, although as of 01:46, 17 February 2016 (EST) there is a problem with IA's trackers where they refuse to track many of the torrents.

Copyright lawsuit

The Internet Archive has faced a lawsuit from publishers over making digital copies of copyright works available. In September 2024 they lost at the Second Circuit Court of Appeals. If the level of damages awarded threatens their existence, then we may need to step at very short notice to rescue their content.

Technical notes

The history of tasks run on each item can be viewed (when logged in) by going to a URL of the form http:// archive.org/history/IDENTIFIER (where IDENTIFIER is the id of the item, e.g. the part after "/details/" in a typical IA url).

Some of the task commands include:

archive.php
Initial uploading, adding of reviews, and other purposes (example)
bup.php
Backing UP items from their primary to their secondary storage location after they are modified (always appears last in any group of tasks) (example)
derive.php
Handles generating the derived data formats (e.g. converting audio files into mp3s, OCRing scanned texts, generating CDX indexes for WARCs) (example)
book_op.php
? Includes virus scan, which usually takes a while. (example)
fixer.php
? (example)
create.php
? (example)
checkin.php
? (example)
delete.php
Used early on (i.e. ~2007) to delete a few items -- not used (except on some test files) since, apparently. (example)
make_dark.php
Removes an item from public view; used for spam, malware, copyright issues, etc. (example)
modify_xml.php
Modify the metadata of an item (?) (example)
make_undark.php
Reverses the effect of make_dark.php (example)

Problems

  • Wayback Machine and Internet Archive suffered from slowdown and long loading times when you go through something, particularly in high-speed internet.
  • Some websites cannot be archived by SPN due to the website itself, or to incorrect SPN behavior:
    • Some URLs are blocked by IA from being archived via SPN
    • IRCCloud pastes: the API and URLs return a blank HTTP 400 error to SPN
    • updates.cdn-apple.com: returns a blank HTTP 400 error to SPN
  • URLs are always normalized when they are indexed by the WBM. This means it cannot differentiate between the capitalization, protocol, or www subdomain of the URL. For example, https://web.archive.org/web/20210125024207/http://www.wiki.archiveteam.org/INDEX.pHp and https://web.archive.org/web/20210125024207/https://wiki.archiveteam.org/index.php link to the same capture, even though the latter is what actually was saved.
  • In 2024 IA was targeted with DDOS and account data theft attacks.
  • In 2024 some user accounts got deleted due to an admin error, they cannot be re-registered and info@archive.org refuses to change the situation. Reviews, forum posts, lists and "My Web Archive" may have been deleted and cannot be restored yet. Uploaded items are still available, but have "Uploaded by Unknown" on them. The only option is to register a different account name, when using the same email address uploads will be re-associated with the new account, but other data will still be lost.
  • In 2025, the SPN email API appeared to have stopped working, but actually it was severely backlogged, with ~40 day processing times.
  • In 2018, HEAD requests to the SPN API stopped working, but are working again in 2025.
  • Since the hack, site visitors without JavaScript enabled get some URLs redirected to the same URLs with ?noscript=true that return a HTTP 404 error. info@archive.org did not reply to a report about this. Examples: [1]
  • In 2017, some functions of the WBM added a requirement for JavaScript to be enabled in visitor's browsers.
  • The HTML that the WBM adds to pages is not valid XML, which means it breaks XHTML pages. The fix is easy, info@archive.org forwarded it to engineers at least twice, but it was never fixed. Examples [2] [3].
  • The SPN screenshots option doesn't also enable saving DOM snapshots.
  • The SPN service with outlinks enabled does not have request rate limits, which can overload some sites.
  • WBM often returns "504 Gateway Time-out" when attempting to download very large files. Example: [4]
  • While saving certain pages, some of them failed to save properly. Example: Job Failed, Save Page Now could not capture this URL because it was unreachable., and SPN internal proxy error. (These issues may be fixed by clicking the "error report" link.)
    • When you click an error report when having issues with saving pages, it may not respond for some reason.

See also

External links

Unofficial mobile apps

References