Archive.today

From Archiveteam
Revision as of 20:41, 22 August 2024 by Censuro (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
Archive.today
Archive-is 2013-07-02 17-05-40.png
URL https://archive.today/ and others
Status Online!
Archiving status Not saved yet
Archiving type Unknown
IRC channel #archiveteam-bs (on hackint)

Archive.today is a privately funded on-demand archiving site, similar to WebCite. It gained traction as an alternative to the Wayback Machine, particularily webpages whose Javascript would fail to replay, or domains that have been or may become excluded and censored from the Wayback Machine.

In some popular news sites and magazines, archive.today is able to save a copy of the article despite paywalls (preservation that is arguably piracy).[1] There seems to be an ethos of archiving content even when that conflicts with replaying web pages completely authentically, different from the Wayback Machine's "view from nowhere". This is also underscored by the fact archive.today appears to avoid saving ads that come with pages,[2] and also gets to the content behind annoyances or login-walls (using dedicated accounts) in popular social media like Twitter[3], Github[4], and Reddit[5].

Search engines are able to index Archive.today.

The website shot up significantly in popularity in the second half of 2014 primarily due to the GamerGate controversy. As of Feb. 2015, the website has archived about 200 "Tb" of data. It is likely 200 Terabyte (TB), not Terabit (Tb) as is quoted. Nonetheless, if accurate, 200Tb ≈ 25TB. For additional confusion, "5Tb" is the site's weekly growth.

Archive.is, archive.ph, archive.md and a few other alternative domains are aliases used by archive.today with the purpose of circumventing blocks by some ISPs.[6][7]

Vital Signs

Note that the site is a commercial enterprise, and as such can go kaputt at any given point, especially if it does not find a lucrative business model. Although it's not a strong indication of long-term issues; in October 2016 the site "made transparent"[IAWcite.todayMemWeb] the server costs[IAWcite.todayMemWeb], and started to accept donations. A weekly crowdfunded target of $800[8] is set to maintain the site.

Prior to this, the site actively refused donations. A donation link took the user to an animal shelter donation page[9].

In January 2017 the administrator commented in response to a censorship query that the site had "just run out of CPU for the browsers"[IAWcite.todayMemWeb]. - With problems capturing pages, it is unclear if this is a temporary issue.

Funding

According to their FAQ[IAWcite.todayMemWeb]:

It is privately funded, there in no complex finance behind it. It may look more or less reliable compared to the startup-style funding or a university project, depending on which risks are taken into account. My death can cause interruption of service, but something like new market condition or changing head of a department can not.

As of October 2016 the site has a 'liberapay'[10] donation link at the top-right corner of the page.

Stated in January 2017, through donations the site only receives "more than $1.50 every day, enough for a bowl of phở".[IAWcite.todayMemWeb]

As of March 2021, archived pages have started to show an advert at the top of the screen however, the owner has confirmed it[IAWcite.todayMemWeb] is a test run and that they will likely not stay.

Content type

Archive.today takes snapshots of webpages with a Javascript-capable headless browser. The maximum size of a webpage it will archive (including images) is 50 MB. An image screenshot of the webpage is also taken. It does not store arbitrary file types unlike the Wayback Machine: It won't save PDFs, binary files, Adobe Flash content, videos, or audio.

Archive.today represents captured pages as a static snapshot, rendered by the Archive.today server, and uses a fixed-width layout. Page resources such as JavaScript and CSS files are not retained separately. That is, styling from a separate CSS file is converted to inline CSS styling, embedded in the HTML source code. For more details on functionality see the Wikipedia page

Snapshots from Wayback Machine and Google Cache are searchable by the original URLs.

Archive.today URLs

In the wild, the top level domain may be any of the #Aliases (.today|.is|.ph|…). Archived pages are accessed through their short URL format, an identifier with five case-sensitive alphanumerical characters and four characters on early captures from 2012. A long, or canonical URL format can be obtained by clicking "share" in the top menu or append "/share" to the URL, but it is not widely used.

Short url
http://archive.today/<XXXXX>
Long url
http://archive.today/<date>/<original url>

Site structure

A list of all domains currently archived used to be available here.

List of all domains from https://archive.today/alldomains (as of 2014/02/20) = 7,255,826 domains

Sadly, the url counts from /alldomains were out of date.

All sitemaps (as of 2014/02/17)

As a side note, the administrator is unsupportive of Internet Archive's robots.txt policy - which could hinder future backup cooperation.

Issues

Domain availability

As of 17 Feb, 2016 archive.today domain name is unavailable since 16 Feb, likely due to "fake DMCA requests", [1].

As of September 2019, archive.today, .fo etc. resolve to 127.0.0.3 from a few DNS servers (including in Finland), while they continue to work elsewhere, where they resolve to 130.0.234.124, 134.119.220.26 etc. The archive.fo domain was revoked on 2019-10-26.[11]

Indefinite loading

Sometimes, the page indicates “loading”[IAWcite.todayMemWeb] when trying to access the page, instead of showing the page itself.

Ditching unsuccessful archivals

When the archival of a page has not been successful (e.g. “Error: time out.”, “Error: Network error.”), the existing information (network transfer and already downloaded ressources) get discarded and the target URL of the page archival indicates “Not Found (yet?)”, the same it shows on pages that have never been archived, similarly to how YouTube behaves.

Dismissed information

Unlike Google Cache, Archive.today does not store the original web page source codes. Also the list of network transfers (shown during archival process) that shows the HTTP status, MIME type, object size (Bytes) and the URL of page elements. File names of saved (embedded) auxiliary page elements get changed into an SHA-1-hashsum of the file itself, discarding the original file names of images.

Since 2016, the Wayback Machine is unable to access Archive.today due to captcha.


Quota limits

Each IP address accessing the site apparently only gets an unknown limited amount of access quota. When archiving too many pages, their server eventually stops responding to the IP address for the next few hours.

Constant reCAPTCHAs

People using a VPN or proxy or on a mobile device report having to go through reCAPTCHAS every time they go through the site. When the captcha is completed, it gives you 5 minutes of access before asking for a Captcha again. Previously, the captcha was unable to be solved on mobile devices since the reCAPTCHA clipped on half the page, but this has since been fixed.

YouTube comment archival

Archive.Today used to be able to capture YouTube comments[12] and load more comments automatically to capture more comments than loaded on the initial AJAX load.
That only worked when archived directly on the YouTube watch page, e.g. “ https://www.youtube.com/watch?v=0mQW9aWkKl0 ”. When redirected from YouTu.be, it failed to archive the YouTube comments.

Because the way YouTube loads comments has been altered over time, since approximately late 2017, Archive.Today's ability to archive YouTube comments has been restricted.
Since then, to archive YouTube comments using Archive.Today, one needs to link directly to a specific comment, which causes comments to be pre-loaded.

Aliases

Besides archive.today[IAWcite.todayMemWeb], the site has been or is available at the following domains:

Other features

It also has the ability to select certain portions of the page and embed that into the URL for sharing a specific portion [14]. This works by using a javascript handler to convert the selector element to a specific portion of the page. That seems to be the only portion of the archived site part of the page that needs Javascript, other than that, the site is completely accessible without it (provided you get past captcha).

Archives

/alldomains Archive

References