Difference between revisions of "Archive.today"

From Archiveteam
Jump to navigation Jump to search
(Very important mention: The robots.txt exclusion – the mortal enemy of Wayback Machine.)
m (→‎Funding: Fix ref link)
(14 intermediate revisions by 7 users not shown)
Line 1: Line 1:
{{Infobox project
{{Infobox project
| title = Archive.is
| title = Archive.today
| description =
| image = Archive-is 2013-07-02 17-05-40.png
| image = Archive-is 2013-07-02 17-05-40.png
| URL =  
| URL = https://archive.today/ and [[#Aliases|others]]
{{url|1=http://archive.is/|2=archive.is}}
* {{url|1=http://archive.vn/|2=archive.vn}}
* {{url|1=http://archive.ec/|2=archive.ec}}
* {{url|1=http://archive.fo/|2=archive.fo}}
* {{url|1=http://archive.is/|2=archive.is}}
* {{url|1=http://archive.today/|2=archive.today}}
* {{url|1=http://archive.md/|2=archive.md}}
* {{url|1=http://archive.ph/|2=archive.ph}}
| project_status = {{online}}
| project_status = {{online}}
| archiving_status = {{nosavedyet}}
| archiving_status = {{nosavedyet}}
}}
}}


Archive.is is a privately funded on-demand archiving site, similar to [[WebCite]]. One key difference is that it stores "Web 2.0" pages better than WebCite; it also supports zip downloads of entire individual webpages and takes a screenshot of the webpage. It is equipped with an URL finder like the [[Wayback Machine]], but additionally text searching feature, powered by Google and also Yandex, which it switches to if Google delivers 0 search results. Unlike on the [[Wayback Machine]], search engines are able to index Archive.is.  It does not store PDFs, binary files, Adobe Flash content, videos, or sounds. The maximum size of a webpage it will archive (including images) is 50MB. Additionally, Archive.is forwards your IP address to the submitted website in a ''X-Forwarded-For'' header.<ref>http://blog.archive.is/post/111779719291/do-you-preserve-archivers-privacy-e-g-not</ref>
'''Archive.today''' is a privately funded on-demand archiving site, similar to [[WebCite]]. One key difference is that it stores "Web 2.0" pages better than WebCite; it also supports zip downloads of entire individual webpages and takes a screenshot of the webpage. It is equipped with an URL finder like the [[Wayback Machine]], but additionally text searching feature, powered by Google and also Yandex, which it switches to if Google delivers 0 search results. Unlike the [[Wayback Machine]], search engines are able to index Archive.today.  It does not store PDFs, binary files, Adobe Flash content, videos, or audio. The maximum size of a webpage it will archive (including images) is 50&nbsp;MB. Additionally, Archive.today forwards your IP address to the submitted website in an ''X-Forwarded-For'' header.<ref>https://blog.archive.today/post/111779719291/do-you-preserve-archivers-privacy-e-g-not</ref>


The main advantage of Archive.is is that it disregards the [[robots.txt]] file that caused many websites and huge amounts of information to become unavailable to the [[Wayback Machine]].
The main advantage of Archive.today is that it disregards the [[robots.txt]] file that caused many websites and huge amounts of information to become unavailable to the [[Wayback Machine]]. Additionally, it allows duplicated snapshots from Wayback Machine and Google Cache (the last of which doesn't store caches indefinitely), searchable by original URLs.


The website shot up significantly in popularity in the second half of 2014 primarily due to the GamerGate controversy. As of Feb. 2015, the website has archived about [http://blog.archive.is/post/111780063961/how-much-storage-is-archive-today-using-currently 200 "Tb" of data.] ''It is likely 200 Terabyte '''TB''', not Terabit '''Tb''' as is quoted. Nonetheless, if accurate, 200Tb ≈ 25TB.''  
The website shot up significantly in popularity in the second half of 2014 primarily due to the GamerGate controversy. As of Feb. 2015, the website has archived about [https://blog.archive.today/post/111780063961/how-much-storage-is-archive-today-using-currently 200 "Tb" of data.] ''It is likely 200 Terabyte ('''TB'''), not Terabit ('''Tb''') as is quoted. Nonetheless, if accurate, 200Tb ≈ 25TB.''  


For additional confusion, "5Tb" is [http://blog.archive.is/post/130682816686/you-mentioned-theres-no-hot-backup-as-of-yet apparently the site's weekly growth].
For additional confusion, "5Tb" is [https://blog.archive.today/post/130682816686/you-mentioned-theres-no-hot-backup-as-of-yet|apparently the site's weekly growth].


On April 14, 2014, Archive.is changed its name to Archive.today due to attacks against [http://www.isnic.is/en/ ISNIC]<ref>http://blog.archive.is/post/82775187091/curious-why-the-move-in-domain-names-from-archive-is</ref><ref>https://twitter.com/archiveis/status/455710701948903424</ref>, and then changed its name back to the original Archive.is some time later.
On April 14, 2014, Archive.is changed its name to Archive.today due to attacks against [https://www.isnic.is/ ISNIC]<ref>{{URL|https://blog.archive.today/post/82775187091/curious-why-the-move-in-domain-names-from-archive-is}}</ref><ref>https://twitter.com/archiveis/status/455710701948903424</ref>, and then changed its name back to the original Archive.is some time later, and then back to Archive.today.


== Vital Signs ==
== Vital Signs ==
Note that the site is a commercial enterprise, and as such can go kaputt at any given point, especially if it does not find a lucrative business model. Although it's not a strong indication of long-term issues; in October 2016 the site {{URL|https://blog.archive.today/post/151979921861/how-are-you-paying-for-the-servers-are-you-just|"made transparent"}} the {{URL|https://blog.archive.today/post/151510917631/how-do-you-guys-keep-the-lights-on-i-gave-the|server costs}}, and started to accept donations. A weekly crowdfunded target of $800<ref>https://liberapay.com/archiveis/donate</ref> is set to maintain the site.


Note that the site is a commercial enterprise, and as such can go kaputt at any given point, especially if it does not find a lucrative business model. Although it's not a strong indication of long-term issues; in October 2016 the site [http://blog.archive.is/post/151979921861/how-are-you-paying-for-the-servers-are-you-just "made transparent"] the [http://blog.archive.is/post/151510917631/how-do-you-guys-keep-the-lights-on-i-gave-the server costs], and started to accept donations. A weekly crowdfunded target of $800<ref>https://liberapay.com/archiveis/donate</ref> is set to maintain the site.
Prior to this, the site actively refused donations. A donation link took the user to an animal shelter donation page<ref>https://web.archive.org/web/20160808113809/https://archive.is/</ref>.
 
Prior to this, the site actively refused donations. A donation link took the user to an animal shelter donation page<ref>http://web.archive.org/web/20160808113809/https://archive.is/</ref>.
 
In January 2017 the administrator commented in response to a censorship query that the site had [http://blog.archive.is/post/155523285411/have-your-servers-really-run-out-space-or-are-you "just run out of CPU for the browsers."] - With problems capturing pages, it is unclear if this is a temporary issue.
 


In January 2017 the administrator commented in response to a censorship query that the site had {{URL|https://blog.archive.today/post/155523285411/have-your-servers-really-run-out-space-or-are-you|"just run out of CPU for the browsers"}}. - With problems capturing pages, it is unclear if this is a temporary issue.


== Funding ==
== Funding ==
 
According to their {{URL|https://archive.today/faq|FAQ}}:
According to their [http://archive.is/faq.html FAQ]:
<blockquote>
<blockquote>
It is privately funded, there in no complex finance behind it. It may look more or less reliable compared to the startup-style funding or an univercity project, depending on which risks are taken into account. My death can cause interruption of service, but something like new market condition or changing head of a department can not.</blockquote>
It is privately funded, there in no complex finance behind it. It may look more or less reliable compared to the startup-style funding or a university project, depending on which risks are taken into account. <s>My death can cause interruption of service, but something like new market condition or changing head of a department can not.</s></blockquote>


As of October 2016 the site has a 'liberapay'<ref>https://liberapay.com/archiveis/donate</ref> donation link at the top-right corner of the page.
As of October 2016 the site has a 'liberapay'<ref>https://liberapay.com/archiveis/donate</ref> donation link at the top-right corner of the page.


Stated in January 2017, through donations the site only receives [http://blog.archive.is/post/154860178511/how-you-make-money "more than $1.50 every day, enough for a bowl of phở".]
Stated in January 2017, through donations the site only receives {{URL|https://blog.archive.today/post/154860178511/how-you-make-money|"more than $1.50 every day, enough for a bowl of phở".}}


==Site structure==
As of March 2021, archived pages have started to show an advert at the top of the screen however, the owner has {{URL|https://blog.archive.today/post/644863407803318272/ads-on-your-website-is-this-a-new-thing-or-just-a|confirmed it}} is a test run and that they will likely not stay.


A list of all domains currently archived is available [http://archive.is/alldomains here].
== Site structure ==
A list of all domains currently archived used to be available [https://archive.today/alldomains here].


[https://archive.org/download/archive.is-alldomains-20140220/archive.is_domains_20140220.txt.7z List of all domains] from [http://archive.is/alldomains archive.is/alldomains] (as of 2014/02/20) = 7,255,826 domains
[https://archive.org/download/archive.is-alldomains-20140220/archive.is_domains_20140220.txt.7z List of all domains] from https://archive.today/alldomains (as of 2014/02/20) = 7,255,826 domains


Sadly, the url counts from /alldomains are out of date.
Sadly, the url counts from /alldomains were out of date.


[https://archive.org/download/archive.is-alldomains-20140220/archive.is_sitemaps_20140217.7z All sitemaps] (as of 2014/02/17)
[https://archive.org/download/archive.is-alldomains-20140220/archive.is_sitemaps_20140217.7z All sitemaps] (as of 2014/02/17)


As a side note, the [http://blog.archive.is/post/117445434661/would-you-consider-handing-over-all-the-captured administrator is unsupportive] of [[Internet Archive]]'s [[robots.txt]] policy - which could hinder future backup cooperation.
As a side note, the [https://blog.archive.today/post/117445434661/would-you-consider-handing-over-all-the-captured administrator is unsupportive] of [[Internet Archive]]'s robots.txt policy - which could hinder future backup cooperation.


== Issues ==
== Issues ==
=== Domain availability ===
=== Domain availability ===
As of 17 Feb, 2016 archive.today domain name is unavailable since 16 Feb, likely due to [http://blog.archive.is/post/138982909006/domain-problems-again "fake DMCA requests"] ([https://web.archive.org/web/20160217044321/http://blog.archive.is/post/138982909006/domain-problems-again copy 1], [https://archive.is/zrsVn copy 2]), [https://twitter.com/archiveis/status/698708729999552512].
As of 17 Feb, 2016 archive.today domain name is unavailable since 16 Feb, likely due to [https://blog.archive.today/post/138982909006/domain-problems-again "fake DMCA requests"], [https://twitter.com/archiveis/status/698708729999552512].
 
As of September 2019, archive.today, .fo etc. resolve to 127.0.0.3 from a few DNS servers (including in Finland), while they continue to work elsewhere, where they resolve to 130.0.234.124, 134.119.220.26 etc. The archive.fo domain was revoked on 2019-10-26.<ref>https://twitter.com/archiveis/status/1188222460598116353</ref>


=== Indefinite loading ===
=== Indefinite loading ===
Sometimes, the page indicates {{URL|http://www.henley-putnam.edu/Portals/_default/Skins/henley/images/loading.gif|loading”}} when trying to access the page, instead of showing the page itself.
Sometimes, the page indicates {{URL|https://www.henley-putnam.edu/Portals/_default/Skins/henley/images/loading.gif|“loading”}} when trying to access the page, instead of showing the page itself.


=== Ditching unsuccessful archivals ===
=== Ditching unsuccessful archivals ===
When the archival of a page has not been successful (e.g. “ Error: time out.”, “ Error: Network error.”), the existing information (network transfer and already downloaded ressources) get discarded and the taget URL of the page archival indicates “Not Found (yet?)”, the same it shows on pages that have never been archived[[YouTube#Reasons_for_video_deletion|, similarly to how YouTube behaves.]]
When the archival of a page has not been successful (e.g. “Error: time out.”, “Error: Network error.”), the existing information (network transfer and already downloaded ressources) get discarded and the target URL of the page archival indicates “Not Found (yet?)”, the same it shows on pages that have never been archived[[YouTube#Reasons_for_video_deletion|, similarly to how YouTube behaves.]]


=== Dismissed information ===
=== Dismissed information ===
Unlike Google Cache, Archive.is does not store the original web page source codes. Also the list of network transfers (shown during archival process) that shows the [[wikipedia:HTTP status|HTTP status]], [[Wikipedia:MIME type|MIME type]], object size (Bytes) and the URL of page elements. File names of saved (embedded) auxiliary page elements get changed into an [[Wikipedia:SHA1|SHA-1-hashsum]] of the file itself, discarding the original file names of images.
Unlike Google Cache, Archive.today does not store the original web page source codes. Also the list of network transfers (shown during archival process) that shows the [[wikipedia:HTTP status|HTTP status]], [[Wikipedia:MIME type|MIME type]], object size (Bytes) and the URL of page elements. File names of saved (embedded) auxiliary page elements get changed into an [[Wikipedia:SHA1|SHA-1-hashsum]] of the file itself, discarding the original file names of images.


Since 2016, the Wayback Machine is unable to access Archive.is due to captcha.
Since 2016, the Wayback Machine is unable to access Archive.today due to captcha.
 
<!--
=== Meta Refresh ===
 
Archive.today uses aggressive meta refreshing. 5 seconds during archival, 1 second during pre-archival queue, of which the latter is exceptionally annoying. A page that redirects or refreshes, for which no Firefox Quantum add-on exists to block it, hijacks the URL bar by changing the contents to the target page, making the URL bar unuseable. The “Esc” key is not as strong as meta refresh. For legacy Firefox, an Add-On for blocking meta refresh/redirects existed.
-->


<!-- Archive.is uses aggressive meta refreshing. 5 seconds during archival, 1 second during pre-archival queue, of which the latter is exceptionally annoying. A page that redirects or refreshes, for which no Firefox Quantum add-on exists to block it, hijacks the URL bar by changing the contents to the target page, making the URL bar unuseable. The “Esc” key is not as strong as meta refresh. For legacy Firefox, an Add-On for blocking meta refresh/redirects existed. -->
=== Quota limits ===
=== Quota limits ===
Each IP address accessing the site apparently only gets an unknown limited amount of access quota. When archiving too many pages, their server eventually stops responding to the IP address for the next few hours.
=== [[YouTube#Comment loading|YouTube comment]] archival ===
Archive.Today used to be able to capture YouTube comments<ref name=CommentArchive>[https://archive.today/d2Cck Sample Archive.Today crawl with YouTube comment loading]</ref> and load more comments automatically to capture more comments than loaded on the initial AJAX load.<br />That only worked when archived directly on the YouTube watch page, e.g. “ https://www.youtube.com/watch?v=0mQW9aWkKl0 ”. When [https://archive.today/dikH8 redirected from YouTu.be], it failed to archive the YouTube comments.
Because [[YouTube#Comment_loading|the way YouTube loads comments]] has been altered over time, since approximately late 2017, Archive.Today's ability to archive YouTube comments has been restricted.
<br />Since then, to archive YouTube comments using Archive.Today, one needs to link directly to a specific comment, which causes comments to be pre-loaded.
* Example linked comment URL: {{URL|1=https://www.youtube.com/watch?v=W3GrSMYbkBE&lc=UgxC238Gea0KGOditl54AaABAg}}
* Archived with linked comment: https://archive.today/OXq7u
* Archived without linked comment: https://archive.today/Uih0b
== Aliases ==
Besides {{url|1=https://archive.today/|2=archive.today}}, the site has been or is available at the following domains:


Each IP address accessing the site apparently only gets an unknown limited amount of access quota. When archiiving too many pages, their server eventually stops responding to the IP address for the next few hours.
* {{url|1=https://archive.is/|2=archive.is}}
* {{url|1=https://archive.li/|2=archive.li}}
* {{url|1=https://archive.vn/|2=archive.vn}}
* {{url|1=https://archive.fo/|2=archive.fo}}
* {{url|1=https://archive.md/|2=archive.md}}
* {{url|1=https://archive.ph/|2=archive.ph}}
* <s>{{url|1=http://archive.ec/|2=archive.ec}}</s> (As far as known, the ''Archive.ec'' domain was only used in 2016.<ref>{{url|2=http://archive.ec/, former (2016) domain of Archive.today, did not block self-archival.|1=https://archive.today/Yf5jR}}</ref>)


== Archives ==
== Archives ==

Revision as of 17:34, 9 March 2021

Archive.today
Archive-is 2013-07-02 17-05-40.png
URL https://archive.today/ and others
Status Online!
Archiving status Not saved yet
Archiving type Unknown
IRC channel #archiveteam-bs (on hackint)

Archive.today is a privately funded on-demand archiving site, similar to WebCite. One key difference is that it stores "Web 2.0" pages better than WebCite; it also supports zip downloads of entire individual webpages and takes a screenshot of the webpage. It is equipped with an URL finder like the Wayback Machine, but additionally text searching feature, powered by Google and also Yandex, which it switches to if Google delivers 0 search results. Unlike the Wayback Machine, search engines are able to index Archive.today. It does not store PDFs, binary files, Adobe Flash content, videos, or audio. The maximum size of a webpage it will archive (including images) is 50 MB. Additionally, Archive.today forwards your IP address to the submitted website in an X-Forwarded-For header.[1]

The main advantage of Archive.today is that it disregards the robots.txt file that caused many websites and huge amounts of information to become unavailable to the Wayback Machine. Additionally, it allows duplicated snapshots from Wayback Machine and Google Cache (the last of which doesn't store caches indefinitely), searchable by original URLs.

The website shot up significantly in popularity in the second half of 2014 primarily due to the GamerGate controversy. As of Feb. 2015, the website has archived about 200 "Tb" of data. It is likely 200 Terabyte (TB), not Terabit (Tb) as is quoted. Nonetheless, if accurate, 200Tb ≈ 25TB.

For additional confusion, "5Tb" is the site's weekly growth.

On April 14, 2014, Archive.is changed its name to Archive.today due to attacks against ISNIC[2][3], and then changed its name back to the original Archive.is some time later, and then back to Archive.today.

Vital Signs

Note that the site is a commercial enterprise, and as such can go kaputt at any given point, especially if it does not find a lucrative business model. Although it's not a strong indication of long-term issues; in October 2016 the site "made transparent"[IAWcite.todayMemWeb] the server costs[IAWcite.todayMemWeb], and started to accept donations. A weekly crowdfunded target of $800[4] is set to maintain the site.

Prior to this, the site actively refused donations. A donation link took the user to an animal shelter donation page[5].

In January 2017 the administrator commented in response to a censorship query that the site had "just run out of CPU for the browsers"[IAWcite.todayMemWeb]. - With problems capturing pages, it is unclear if this is a temporary issue.

Funding

According to their FAQ[IAWcite.todayMemWeb]:

It is privately funded, there in no complex finance behind it. It may look more or less reliable compared to the startup-style funding or a university project, depending on which risks are taken into account. My death can cause interruption of service, but something like new market condition or changing head of a department can not.

As of October 2016 the site has a 'liberapay'[6] donation link at the top-right corner of the page.

Stated in January 2017, through donations the site only receives "more than $1.50 every day, enough for a bowl of phở".[IAWcite.todayMemWeb]

As of March 2021, archived pages have started to show an advert at the top of the screen however, the owner has confirmed it[IAWcite.todayMemWeb] is a test run and that they will likely not stay.

Site structure

A list of all domains currently archived used to be available here.

List of all domains from https://archive.today/alldomains (as of 2014/02/20) = 7,255,826 domains

Sadly, the url counts from /alldomains were out of date.

All sitemaps (as of 2014/02/17)

As a side note, the administrator is unsupportive of Internet Archive's robots.txt policy - which could hinder future backup cooperation.

Issues

Domain availability

As of 17 Feb, 2016 archive.today domain name is unavailable since 16 Feb, likely due to "fake DMCA requests", [1].

As of September 2019, archive.today, .fo etc. resolve to 127.0.0.3 from a few DNS servers (including in Finland), while they continue to work elsewhere, where they resolve to 130.0.234.124, 134.119.220.26 etc. The archive.fo domain was revoked on 2019-10-26.[7]

Indefinite loading

Sometimes, the page indicates “loading”[IAWcite.todayMemWeb] when trying to access the page, instead of showing the page itself.

Ditching unsuccessful archivals

When the archival of a page has not been successful (e.g. “Error: time out.”, “Error: Network error.”), the existing information (network transfer and already downloaded ressources) get discarded and the target URL of the page archival indicates “Not Found (yet?)”, the same it shows on pages that have never been archived, similarly to how YouTube behaves.

Dismissed information

Unlike Google Cache, Archive.today does not store the original web page source codes. Also the list of network transfers (shown during archival process) that shows the HTTP status, MIME type, object size (Bytes) and the URL of page elements. File names of saved (embedded) auxiliary page elements get changed into an SHA-1-hashsum of the file itself, discarding the original file names of images.

Since 2016, the Wayback Machine is unable to access Archive.today due to captcha.


Quota limits

Each IP address accessing the site apparently only gets an unknown limited amount of access quota. When archiving too many pages, their server eventually stops responding to the IP address for the next few hours.

YouTube comment archival

Archive.Today used to be able to capture YouTube comments[8] and load more comments automatically to capture more comments than loaded on the initial AJAX load.
That only worked when archived directly on the YouTube watch page, e.g. “ https://www.youtube.com/watch?v=0mQW9aWkKl0 ”. When redirected from YouTu.be, it failed to archive the YouTube comments.

Because the way YouTube loads comments has been altered over time, since approximately late 2017, Archive.Today's ability to archive YouTube comments has been restricted.
Since then, to archive YouTube comments using Archive.Today, one needs to link directly to a specific comment, which causes comments to be pre-loaded.

Aliases

Besides archive.today[IAWcite.todayMemWeb], the site has been or is available at the following domains:

Archives

/alldomains Archive

References