Dealing with Cloudflare
Warning
All of these methods are not Wayback-grade. They involve messing with /etc/hosts or just pointing your tools to a different address.
Use them only if you are desperate for the actual data - this includes WikiTeam tasks. For ArchiveBot, see Obstacles#TLS fingerprinting.
Verified and fresh as of early November 2025 - AI companies are attacking, arms race is progressing, everything may change with zero notice.
Scenario 1 - Attack Mode
Doesn't work anywhere, throwing a CAPTCHA.
So, you'll need to bypass Cloudflare. There are guides all over the Internet, but here is a summary so that you don't have to find your way through AI slop.
- Ready-made tools like CloudFail - patch out DNSDumpster or it won't work.
diff --git a/DNSDumpsterAPI.py b/DNSDumpsterAPI.py
index b5fb7bc..2683803 100644
--- a/DNSDumpsterAPI.py
+++ b/DNSDumpsterAPI.py
@@ -65,6 +65,15 @@ class DNSDumpsterAPI(object):
def search(self, domain):
+ res = {}
+ res['domain'] = domain
+ res['dns_records'] = {}
+ res['dns_records']['dns'] = {}
+ res['dns_records']['mx'] = {}
+ res['dns_records']['txt'] = {}
+ res['dns_records']['host'] = {}
+
+ return res
dnsdumpster_url = 'https://dnsdumpster.com/'
req = self.session.get(dnsdumpster_url)
- Force the server to reach out to you.
- MediaWiki, some forums, shops, etc. will by default try to send you an email, usually directly. Then, look at Received: headers.
- On Wordpress systems, you can abuse pingbacks: guide
- Some forums (in particular Discourse) and Mastodon-like systems (but not every Fediverse server) will generate link previews.
- Fediverse servers can be provoked to talk ActivityPub to you, but this is harder.
- I've even seen one running effectively an open proxy. Limited image proxies are more common.
- You might find something else. Use your own creative thinking.
- Old IPs at DNS History and similar services.
- Subdomains: Finding subdomains
- Shodan, Censys might have seen a server identifying itself with the domain in question.
When you get a candidate IP, curl https://$TARGET/ --connect-to ::$IP and see if you have a good homepage.
It's possible the origin is properly firewalled. You can check this by poking it from Cloudflare Workers (cf. Scenario 2), and looking for different results (timeout, refused vs. 403, TLS failed). I can't find a way to actually use it, though, but at least it's a good sign to stop looking deeper.
If it's not, # echo '$IP $TARGET' >> /etc/hosts and go wild!
Scenario 2 - TLS fingerprinting
Works seamlessly in normal browser and curl-impersonate, sometimes works in ArchiveBot, but doesn't work in WikiBot, WikiTeam, curl, wget etc. That indicates TLS fingerprinting or whatever fancy technology was invented in the meantime.
If you can solve it the way described in Scenario 1, do so. But if you can't, read on.
Enter Cloudflare Workers - your code, running at Cloudflare's servers!
You can't cram a full WikiTeam or even CGIProxy here, but it's possible to use nonetheless.
Justauser personally uses this for testing and this (with empty replace_dict) for WikiTeam dumps.
Make an account (throwaway emails OK), start with Hello World, paste one of the above over it, tweak settings, deploy, run whatever you wanted on the URL you get.
Whatever worker uses for HTTP requests is, for obvious reasons, OK on target firewall (an IP whitelist, perhaps? I'm not sure, but it works), and the worker itself is on your account so you get to define its access control.
Additional benefits: you can change your requests and responses, seamless for the tooling.
For a real example, WAF at proteopedia.org throws a 403 if URL contains a bracket, so for the few pages whose title has it, I export the XML manually (browser puts the page name to the POST body, but WikiTeam doesn't; alternatively, you can edit a page by a revision ID) and tell the worker to return the canned response:
if (request.url.toString().indexOf('3-oxoacyl-%28acyl-carrier-protein%29_synthase&action=submit')>0) {
let r = new Response (atob('PG1lZG...raT4'));
return r;
}
What doesn't work
- WARP: seems to use its own IP space, not whitelisted.
- curl-impersonate, Req and surf: don't integrate with any existing tools.
- Joining any Archiveteam irc that has eggdrop and typing
Buttflare--: helps with the sadness and rage, but not with archival.
Untested
- hrequests
- It looks like sometimes you can solve the CAPTCHA in a browser and keep using the cookie. Allegedly User-agent, IP and cf_clearance cookie have to match: more specific information welcome.
Other notes
- Some sites that look like they might be using Cloudflare might not actually be, for example if they use https://github.com/donlon/cloudflare-error-page[IA•Wcite•.today] or one of its forks. Archive.today is one of them.