Obstacles
When archiving things online, there are obstacles we may encounter.
These come in many types of forms:
We may find anti-bot software, such as Anubis, go-away, yap, haphash, Cloudflare, Akamai, Fastly, reCAPTCHA, hCaptcha, Vercel Security Checkpoint, Incapsula, Deflect, ALTCHA, BasedFlare (analysis), sethrawall, botcheck, BunkerWeb, PerimeterX/HUMAN, Sucuri, DataDome (aka captcha-delivery.com) although there's a lot more of these kinds of software
These programs are designed to avoid automated programs, and due to the fact archiving manually is a lot of pain, things like ArchiveBot get caught in the wildfire, and will need monitoring to detect instances of blockage.
Poisoners
Nepenthes, Iocaine, Spigot and other systems tarpit web crawlers by serving them endless streams of junk data in order to poison training data for LLMs. These usually are easily detectable in ArchiveBot URL logs since they produce large amounts of nonsense URLs.
Sometimes WAFs have rules to try to block SQL injections, but sometimes, wikis use some of those urls, that could seem like SQLi, so WAFs block them, preventing archival.
Rate limits
These could be applied based on average traffic, by IPs or IP ranges, or under some other methods.
Slow servers
Often the servers hosting the site have limited capacity, which effectively acts as a global rate limit. Sometimes they have autoscaling but it requires careful curation of crawling rate to get this to happen. In Cohost's case, one particular type of page (per-user subdomains) was dramatically slowing down the site and we only found this out in the last days of the project.
Geoblocking
Due to the use of GeoIP combined with blocking, some sites cannot be accessed from certain parts of the world, or can only be accessed from certain parts of the world. Currently AT has no in-use solution to workaround this, but manually setting up new ArchiveBot pipelines, or #Y may work. There are several services to detect reachability worldwide by sending HTTP and other requests:
- https://globalping.io/[IA•Wcite•.today]
- https://check-host.net/[IA•Wcite•.today]
- https://ping-admin.com/[IA•Wcite•.today]
- https://ping.pe/[IA•Wcite•.today]
(or maybe #Y at some point).
JavaScript
Many websites now require JavaScript to get any data. ArchiveBot can't run JavaScript but does try hard to get resources linked from it. Mnbot/SPN/archive.today and other services can archive single pages that require JavaScript. Warrior projects can "get" Javascript but only by a large amount of manual work of effectively rewriting part of the site's JS in Lua.
Playback
Often uses of relatively rare Javascript APIs and tricks can break the Wayback Machine's playback. Especially annoying are scripts that blank the whole page when any error occurs, which mean that the page can be viewed just fine but only with Javascript off. Also problematic are random nonces or client-side datetime parameters in URLs.
Websocket
WARC has no way to record Websocket. Any data that goes through it requires custom capture, custom storage, and custom playback. Particularly memorable in this regard was a site called Peerlyst, which had effectively reinvented HTTP over a websocket inside of an SPA.[1]
Fingerprinting
Some sites use TLS fingerprinting to allow known browser TLS implementations, but block other web clients. The ArchiveBot pipelines that have a different OpenSSL config will have a different TLS fingerprint. The "Copy as cURL" option of browser dev tools network monitoring can be used to detect this. Mnbot/SPN/archive.today may be able to archive affected pages.
Other sites may use multi-protocol passive fingerprinting (TCP/TLS/HTTP) to achieve blocking and Huginn Net is an example of an open source tool that can do this.
Encryption choices
Some sites only support SSL or old versions of TLS, or deprecated ciphers, all of which are considered obsolete on the modern web and are disabled by default in current web encryption implementations like OpenSSL. Some of the ArchiveBot pipelines (pokepipe, nullpipe, tantalus, bonkpipe) are currently configured to accept some older encryption, but the others are not, and the config might not allow all old web encryption. Future versions of web encryption implementations may make it even harder or impossible to accept older encryption.
Encoding choices
Some archiving tools only support certain certain character sets. For example ArchiveBot doesn't extract links from HTML pages encoded in UTF-16 with BOM.
Protocol usage
HTTP POST requests usually aren't supported by archiving tools and playback in the WBM might not work.
Protocol choices
The WARC format standard doesn't currently support HTTP 2/3 or other protocols, so sites that require them can't be viewed in the WBM normally and require various worksarounds to archive - refer to SmolNet for some examples. Common Crawl, CDP-based tools and some other crawlers can make HTTP/2 requests but always records them as HTTP/1.1[2].
Protocol compliance
Some sites return 404/etc errors instead of 200 for every request for URLs that do exist. Others return 200 instead of 404 for every request even for pages that don't exist. Usually the page content will indicate if the page really exists.
Protocol compliance issues prevent ArchiveBot from saving pages, or lead to too many pages being archived and having to be manually ignored.
Big ID spaces
In particular on sites where not every page can be expected to be linked from somewhere - for instance file hosts - often the only alternative we are left with for discovery is to try every possible ID. But for some sites the space has trillions or more of possible IDs - especially when there has been a deliberate decision to make it too big to search, as happened with Google Drive and Roblox.
| This article is a stub. You can help by expanding it. |