Anubis

From Archiveteam
Jump to navigation Jump to search

Anubis is a "self hostable scraper defense software"[1] that is starting to get adopted especially by open source projects which are struggling under the load those aggressive AI/LLM scrapers cause on their infrastructure.

Since the bots and crawlers we employ match the description as well, they can also get detected and blocked by Anubis.

How it works

Anubis is a man-in-the-middle HTTP proxy that requires clients to either solve or have solved a proof-of-work challenge before they can access the site. This is a very simple way to block the most common AI scrapers because they are not able to execute JavaScript to solve the challenge. The scrapers that can execute JavaScript usually don't support the modern JavaScript features that Anubis requires. In case a scraper is dedicated enough to solve the challenge, Anubis lets them through because at that point they are functionally a browser.

The most hilarious part about how Anubis is implemented is that it triggers challenges for every request with a User-Agent containing "Mozilla". Nearly all AI scrapers (and browsers) use a User-Agent string that includes "Mozilla" in it. This means that Anubis is able to block nearly all AI scrapers without any configuration. —Xe[2]

Problems caused by Anubis

As the default user agent for Archivebot includes the word "Mozilla"[3], it triggers the Anubis challenge as well and ultimately fails because it can't solve it. After a friendly discussion in the #archiveteam-bs channel[4], a temporary solution was found by making Archivebot use the 'curl' useragent, which doesn't trigger the challenge, but the curl useragent could trigger other errors on outlinks. Some sites do not allowlist non-Mozilla UAs, and thus cannot be archived.

Wikibot should still be affected by this too, but is able to set an arbitrary User-Agent header, so it is easy to workaround it.

mnbot is blocked with the default User-Agent, but the stealth User-Agent is not blocked.

IA SPN is not blocked by Anubis.

Modern browsers that have Mozilla in the User-Agent but have cookies or JavaScript disabled (including the Tor Browser on Security Level: Safest) or (for Anubis 1.20.0) disabled <meta refresh>, cannot complete the Anubis PoW, so the anubis-bypass WebExtension (code) is needed. Cookie blocking extensions may interfere with the Anubis detection of anubis-bypass, in that case you can either edit the code to add the hostnames, or temporarily enable cookies on the affected hostnames once, then anubis-bypass detection will work and it will will store the hostname in its browser localStorage settings for future use, and cookies can be remain disabled. For sites with high difficulty, anubis_offload may be useful to offload the PoW computation to a GPU or native code on the CPU.

Projects and websites known to deploy Anubis

See: Anubis/uncategorized

Resources

References