Tony Wang11 min readScraping Sites That Block Bots: Cloudflare, DataDome & PerimeterX
Why scrapers get blocked by Cloudflare, DataDome and PerimeterX — and how to get through reliably with stealth browsers, IP rotation and clearance reuse.
For most of the web, scraping is a solved problem: fetch the URL, parse the HTML, done. The interesting sites — the ones with prices, listings, reviews, and inventory worth collecting — are exactly the ones that don't let you. They sit behind Cloudflare, DataDome, PerimeterX (now HUMAN), Akamai, or Kasada, and the moment a script asks for a page it gets a CAPTCHA, a "checking your browser" interstitial, or a flat 403. The hard part of modern scraping isn't parsing the page. It's getting the page.
This guide explains how that wall actually works — the signals these systems check and why a normal HTTP client trips every one of them — and then how a scraper gets through reliably without pretending the problem is simpler than it is.
The four layers of bot detection
Anti-bot vendors don't rely on a single check. They score every request across several independent layers, and a request has to look human on all of them. Understanding the layers is the whole game, because each one rules out a different class of naive scraper.
| Layer | What it checks | Why a normal scraper fails |
|---|---|---|
| IP reputation | Is the address residential/mobile (trusted) or datacenter (suspect)? Has it made too many requests? | Scrapers run on cloud servers and datacenter proxies — ranges these vendors flag on sight |
| TLS / HTTP fingerprint | Does the TLS handshake and HTTP/2 frame order match a real browser (JA3/JA4, Akamai fingerprint)? | requests, curl, and most HTTP libraries have a fingerprint nothing like Chrome's, no matter the headers |
| JavaScript sensor | A script probes for navigator.webdriver, headless tells, missing APIs, canvas/WebGL/font quirks | Headless automation leaks dozens of signals; an HTTP client runs no JavaScript at all, so the sensor never reports back |
| Behaviour | Request cadence, mouse/scroll, navigation patterns, cookie continuity | Scripts hit pages faster and more mechanically than a person, from a session with no history |
The reason "just set a realistic User-Agent" stopped working years ago is that the User-Agent is a single string in the least important layer. You can claim to be Chrome 140 all you like; if your TLS handshake says Python and your IP says AWS, you've already failed two checks before the page even loads.
What the vendors actually do
The big three behave differently enough that it's worth knowing which wall you're looking at.
- Cloudflare issues a managed challenge and, on success, a
cf_clearancecookie that's bound to your IP and User-Agent. Pass the challenge and the cookie buys you a window of access — change IP and it's void. - DataDome scores the request in real time and sets a
datadomecookie; it's aggressive about datacenter IPs and replays of a fingerprint across too many requests. - PerimeterX / HUMAN runs a JavaScript sensor that POSTs a signal payload and grants a
_px3-family clearance cookie. It's strongly IP-bound: a cookie minted on one IP, presented from another, reads as theft and gets blocked harder than no cookie at all.
The common thread: a clearance — the cookie that says "this client passed" — is only earned by executing the challenge in a real browser, and it's tied to the IP that earned it. That single fact dictates the entire strategy.
Which wall, which fix
Each vendor leaves a fingerprint of its own — a tell-tale cookie or block message — and yields to a different lever. This is the rough field guide:
| Wall | How you know it's there | What gets through | Difficulty |
|---|---|---|---|
| Cloudflare | cf_clearance / __cf_bm cookies, a "Checking your browser…" interstitial | A real browser that solves the managed challenge; then reuse cf_clearance | Low–medium |
| DataDome | datadome cookie, a 403 with a DataDome CAPTCHA page | Trusted IP + real fingerprint; rotate IPs — it punishes replay | Medium |
| PerimeterX / HUMAN | _px* cookies, "Access to this page has been denied" | A browser that runs the JS sensor; clearance is IP-bound, so race fresh IPs | Medium–high |
| Akamai Bot Manager | _abck / bm_sz / ak_bmsc cookies | Real TLS fingerprint + browser; the _abck cookie must validate or you're shadow-blocked | High |
| Kasada | x-kpsdk-* headers, a kpsdk sensor script | Full browser execution of the sensor; among the hardest to pass headlessly | High |
The pattern repeats: every one of them ultimately wants to see a real browser, from an IP it trusts, behaving like a person. The walls differ mostly in how strict each of those three checks is.
How to get through reliably
The mistake most scrapers make is doing one thing — a stealth browser, or a residential proxy — for every request. That's slow and expensive on easy pages and still fragile on hard ones. The reliable pattern is escalation: start cheap, and climb only as far as a specific site forces you to.
- Chrome-impersonated HTTP. A plain request, but with a TLS and HTTP/2 fingerprint that matches real Chrome. This alone clears the fingerprint layer and is enough for a surprising number of "protected" sites — at a fraction of the cost of a browser.
- A real stealth browser. When the page needs JavaScript executed — a Cloudflare or PerimeterX challenge — hand it to a fleet of hardened browser engines with patched automation tells and genuine fingerprints. Different engines beat different vendors, so racing or rotating across a fleet matters more than betting on one.
- A fresh IP. Anti-bot is IP-reputation-first, so when the current exit is flagged, the highest-yield move is simply a different address. Firing several requests through a rotating pool at once — and taking the first that comes back with the real page — turns a coin-flip into a near-certainty, because one fresh IP usually passes while others are blocked.
Not all addresses are equal, though, and this is where cost enters. IPs come in tiers of reputation, and the more trusted the tier, the more it costs:
- Datacenter IPs are cheap and plentiful, but the most flagged — whole ranges are known to belong to clouds and hosting providers, so vendors distrust them by default.
- Residential IPs are real home-broadband addresses sourced through proxy networks. They look like ordinary visitors, carry far more trust, and cost meaningfully more.
- Mobile IPs are carrier-NAT'd 4G/5G addresses — the most trusted of all, because thousands of real phones share one address, so blocking it risks blocking real customers. They're also the most expensive, usually billed by the gigabyte of traffic rather than per address.
The reflex on a hard wall is to reach straight for the priciest IPs. The cheaper play is the one this whole section is about: lean on the stack — escalate to a real browser, race several datacenter IPs at once, and reuse a clearance once you earn it — so you get through on low-cost addresses as often as possible, and only fall back to residential or mobile for the targets that genuinely demand them. You pay for reputation exactly when the page forces you to, and not a request sooner.
Two refinements make this fast as well as reliable:
- Reuse the clearance. When a browser earns a
cf_clearanceor DataDome cookie, cache it per-domain for its short lifetime and attach it to later requests from the same IP. A cheap engine can then ride a clearance an expensive one paid for — higher success, lower cost. - Only count a real page as success. A challenge page, a "checking your browser" shell, and a 403 all return a
200with bytes in the body. A scraper that treats those as success hands you garbage and learns nothing. Detecting the difference — and escalating instead of returning the shell — is what separates a number that looks good from data you can use.
A worked example: a PerimeterX-protected page
Here's how the pieces compose on a real, stubborn target — a news site behind PerimeterX, where a datacenter exit IP had been used enough that the wall had already flagged it.
- One engine, one flagged IP: asking a single stealth browser to fetch the page from that burned IP succeeded only about 1 in 6 times. The engine was capable; the IP was the bottleneck.
- Race fresh IPs: firing four requests for the same page concurrently, each through a different exit IP, and taking the first that returned the real article, lifted success to 5 in 6 — and it was faster, because the winner usually came back in a few seconds while the blocked attempts were abandoned.
- Reuse the clearance: once one browser earned PerimeterX's clearance cookie, even a lightweight engine that never beats the wall on its own rode that cached cookie straight to the full page. The expensive request paid for the clearance; the cheap ones cashed it in.
No single one of these is a silver bullet. The result comes from stacking them — escalate to a browser, race fresh IPs, reuse what works — and from refusing to count a challenge page as a win. That last point is easy to get wrong: the naive version returns a 200 full of nothing and reports a great success rate.
Build it yourself, or buy it?
You can assemble all of this yourself, and for a single target it's a reasonable weekend project: a stealth-patched browser, a proxy, a cookie cache. The cost shows up later, as a treadmill.
Building it means owning a fleet of patched browser engines (the patches go stale every Chrome release), a residential or mobile proxy budget (the line item that actually sets your ceiling), the escalation logic that picks the cheapest method per site, success detection that isn't fooled by challenge pages, and a standing commitment to re-fix all of it every time a vendor ships a detection update — which is their full-time job and, for you, a side-quest that keeps interrupting the real one.
Buying it — a managed scraping API — trades that treadmill for a per-request price. The honest comparison isn't "API credits versus free code"; it's "API credits versus a proxy bill plus the engineering weeks you'll spend maintaining detection bypasses instead of shipping your product." For one or two easy sites, DIY wins. For a moving list of protected targets you need to keep reliable, the maintenance is the product — and that's the part worth outsourcing.
Where Crawlora fits
This is exactly the model Crawlora's Web Scraping API is built on. A single /web/scrape call escalates on its own — Chrome-impersonated HTTP, then a fleet of stealth browser engines, then fresh IPs raced concurrently — captures and reuses clearance cookies per domain, and only returns a page once it's confirmed to be real content rather than a challenge. You get clean Markdown (or HTML, links, and metadata) back, and you're billed for what succeeds.
A request is one call — ask for the formats you want and let it escalate:
curl -X POST "https://api.crawlora.net/api/v1/web/scrape" \
-H "x-api-key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/product/123",
"formats": ["markdown", "links", "metadata"],
"render": "auto"
}'
render: "auto" is the escalation switch: it starts with Chrome-impersonated HTTP and climbs to the stealth-browser fleet (and fresh IPs) only if the page demands it, so you don't pay browser prices for pages that don't need a browser. The same call from Python:
import requests
resp = requests.post(
"https://api.crawlora.net/api/v1/web/scrape",
headers={"x-api-key": "YOUR_API_KEY"},
json={
"url": "https://example.com/product/123",
"formats": ["markdown", "links", "metadata"],
"render": "auto", # escalate HTTP -> stealth browser -> fresh IP as needed
},
timeout=60,
)
data = resp.json()["data"]
print(data["markdown"]) # clean article text, not raw HTML
print(data["metadata"]["title"]) # parsed page metadata
If you just want to know what you're up against before you write a line of code, run a URL through the free can-I-scrape-this-site checker: it reports which protection a site uses and how hard it'll be to collect.
A closing note on scope: everything here is for public pages. Bot detection is not a login, and getting past one isn't the same as breaking into private data — but the responsible line is still public data only, with each site's terms and robots directives respected, and personal or copyrighted content left alone. For the legal landscape, see is web scraping legal in 2026; for the related question of gated articles, see how paywalls actually work.
Scrape the sites that block everyone else
One API call — auto-escalating stealth browsers, rotating IPs, and clearance reuse — that returns clean Markdown and bills only for what succeeds. 2,000 free credits a month, no card.
Frequently asked questions
How do websites detect and block scrapers?
Modern anti-bot systems check four layers at once: the IP's reputation (datacenter ranges are flagged, residential and mobile are trusted), the TLS and HTTP/2 fingerprint (a real Chrome handshake looks different from a Python or curl one), a JavaScript sensor that probes the browser for automation tells (headless flags, missing APIs, canvas/WebGL quirks), and behaviour over time. Failing any layer earns a CAPTCHA or a block page, so a scraper that only spoofs the User-Agent gets stopped immediately.
Why does my scraper get blocked by Cloudflare or DataDome?
Almost always the IP and the fingerprint. Requests from a cloud server (AWS, GCP, a datacenter proxy) sit in ranges these vendors treat as suspicious, and a non-browser HTTP client has a TLS/JS fingerprint that doesn't match a real Chrome. Cloudflare, DataDome and PerimeterX combine those signals — so the fix isn't a better User-Agent string, it's a real browser fingerprint coming from a trusted IP.
Can you scrape a site protected by Cloudflare, DataDome or PerimeterX?
Often, yes, for public pages — but it's probabilistic, not guaranteed. The reliable approach escalates only as far as a site demands: a Chrome-impersonated HTTP request first, then a real stealth browser that executes the challenge, then a fresh IP if the current one is flagged. Once a browser earns a clearance cookie, cheaper requests can reuse it. The hardest, IP-strict sites need residential or mobile egress to stay reliable.
Is it legal to scrape sites that block bots?
Scraping publicly accessible pages is broadly defensible, and a bot-detection wall is not itself an access control on private data the way a login is. But the law turns on what you collect and how you use it — respect each site's terms and robots directives, avoid personal data and copyrighted content at scale, and never use this to get past a login or a paywall. This is not legal advice.