Tony WangJune 11, 202611 min read

Scraping Sites That Block Bots: Cloudflare, DataDome & PerimeterX

Why scrapers get blocked by Cloudflare, DataDome and PerimeterX — and how to get through reliably with stealth browsers, IP rotation and clearance reuse.

Web Scraping API Anti-Bot Guide

For most of the web, scraping is a solved problem: fetch the URL, parse the HTML, done. The interesting sites — the ones with prices, listings, reviews, and inventory worth collecting — are exactly the ones that don't let you. They sit behind Cloudflare, DataDome, PerimeterX (now HUMAN), Akamai, or Kasada, and the moment a script asks for a page it gets a CAPTCHA, a "checking your browser" interstitial, or a flat 403. The hard part of modern scraping isn't parsing the page. It's getting the page.

This guide explains how that wall actually works — the signals these systems check and why a normal HTTP client trips every one of them — and then how a scraper gets through reliably without pretending the problem is simpler than it is.

Key takeaways

Bot detection is layered: IP reputation, the TLS/HTTP fingerprint, a JavaScript sensor that probes the browser, and behaviour over time. You have to pass all of them, not one.
A datacenter IP and a non-browser fingerprint are what get most scrapers blocked — not a missing User-Agent. Spoofing headers alone does almost nothing.
The reliable pattern is escalation: a cheap Chrome-impersonated request first, a real stealth browser only when the site demands it, and a fresh IP when the current one is burned.
When a browser earns a clearance cookie, cheaper requests can reuse it — and a request should only count as success when it returns the real page, not a challenge.
None of this is 100%. It's probabilistic, the hardest sites need residential or mobile IPs, and it applies to public data only.

The four layers of bot detection

Anti-bot vendors don't rely on a single check. They score every request across several independent layers, and a request has to look human on all of them. Understanding the layers is the whole game, because each one rules out a different class of naive scraper.

Layer	What it checks	Why a normal scraper fails
IP reputation	Is the address residential/mobile (trusted) or datacenter (suspect)? Has it made too many requests?	Scrapers run on cloud servers and datacenter proxies — ranges these vendors flag on sight
TLS / HTTP fingerprint	Does the TLS handshake and HTTP/2 frame order match a real browser (JA3/JA4, Akamai fingerprint)?	`requests`, `curl`, and most HTTP libraries have a fingerprint nothing like Chrome's, no matter the headers
JavaScript sensor	A script probes for `navigator.webdriver`, headless tells, missing APIs, canvas/WebGL/font quirks	Headless automation leaks dozens of signals; an HTTP client runs no JavaScript at all, so the sensor never reports back
Behaviour	Request cadence, mouse/scroll, navigation patterns, cookie continuity	Scripts hit pages faster and more mechanically than a person, from a session with no history

The reason "just set a realistic User-Agent" stopped working years ago is that the User-Agent is a single string in the least important layer. You can claim to be Chrome 140 all you like; if your TLS handshake says Python and your IP says AWS, you've already failed two checks before the page even loads.

What the vendors actually do

The big three behave differently enough that it's worth knowing which wall you're looking at.

Cloudflare issues a managed challenge and, on success, a cf_clearance cookie that's bound to your IP and User-Agent. Pass the challenge and the cookie buys you a window of access — change IP and it's void.
DataDome scores the request in real time and sets a datadome cookie; it's aggressive about datacenter IPs and replays of a fingerprint across too many requests.
PerimeterX / HUMAN runs a JavaScript sensor that POSTs a signal payload and grants a _px3-family clearance cookie. It's strongly IP-bound: a cookie minted on one IP, presented from another, reads as theft and gets blocked harder than no cookie at all.

The common thread: a clearance — the cookie that says "this client passed" — is only earned by executing the challenge in a real browser, and it's tied to the IP that earned it. That single fact dictates the entire strategy.

Which wall, which fix

Each vendor leaves a fingerprint of its own — a tell-tale cookie or block message — and yields to a different lever. This is the rough field guide:

Wall	How you know it's there	What gets through	Difficulty
Cloudflare	`cf_clearance` / `__cf_bm` cookies, a "Checking your browser…" interstitial	A real browser that solves the managed challenge; then reuse `cf_clearance`	Low–medium
DataDome	`datadome` cookie, a 403 with a DataDome CAPTCHA page	Trusted IP + real fingerprint; rotate IPs — it punishes replay	Medium
PerimeterX / HUMAN	`_px*` cookies, "Access to this page has been denied"	A browser that runs the JS sensor; clearance is IP-bound, so race fresh IPs	Medium–high
Akamai Bot Manager	`_abck` / `bm_sz` / `ak_bmsc` cookies	Real TLS fingerprint + browser; the `_abck` cookie must validate or you're shadow-blocked	High
Kasada	`x-kpsdk-*` headers, a `kpsdk` sensor script	Full browser execution of the sensor; among the hardest to pass headlessly	High

The pattern repeats: every one of them ultimately wants to see a real browser, from an IP it trusts, behaving like a person. The walls differ mostly in how strict each of those three checks is.

How to get through reliably

The mistake most scrapers make is doing one thing — a stealth browser, or a residential proxy — for every request. That's slow and expensive on easy pages and still fragile on hard ones. The reliable pattern is escalation: start cheap, and climb only as far as a specific site forces you to.

Chrome-impersonated HTTP. A plain request, but with a TLS and HTTP/2 fingerprint that matches real Chrome. This alone clears the fingerprint layer and is enough for a surprising number of "protected" sites — at a fraction of the cost of a browser.
A real stealth browser. When the page needs JavaScript executed — a Cloudflare or PerimeterX challenge — hand it to a fleet of hardened browser engines with patched automation tells and genuine fingerprints. Different engines beat different vendors, so racing or rotating across a fleet matters more than betting on one.
A fresh IP. Anti-bot is IP-reputation-first, so when the current exit is flagged, the highest-yield move is simply a different address. Firing several requests through a rotating pool at once — and taking the first that comes back with the real page — turns a coin-flip into a near-certainty, because one fresh IP usually passes while others are blocked.

Not all addresses are equal, though, and this is where cost enters. IPs come in tiers of reputation, and the more trusted the tier, the more it costs:

Datacenter IPs are cheap and plentiful, but the most flagged — whole ranges are known to belong to clouds and hosting providers, so vendors distrust them by default.
Residential IPs are real home-broadband addresses sourced through proxy networks. They look like ordinary visitors, carry far more trust, and cost meaningfully more.
Mobile IPs are carrier-NAT'd 4G/5G addresses — the most trusted of all, because thousands of real phones share one address, so blocking it risks blocking real customers. They're also the most expensive, usually billed by the gigabyte of traffic rather than per address.

The reflex on a hard wall is to reach straight for the priciest IPs. The cheaper play is the one this whole section is about: lean on the stack — escalate to a real browser, race several datacenter IPs at once, and reuse a clearance once you earn it — so you get through on low-cost addresses as often as possible, and only fall back to residential or mobile for the targets that genuinely demand them. You pay for reputation exactly when the page forces you to, and not a request sooner.

Two refinements make this fast as well as reliable:

Reuse the clearance. When a browser earns a cf_clearance or DataDome cookie, cache it per-domain for its short lifetime and attach it to later requests from the same IP. A cheap engine can then ride a clearance an expensive one paid for — higher success, lower cost.
Only count a real page as success. A challenge page, a "checking your browser" shell, and a 403 all return a 200 with bytes in the body. A scraper that treats those as success hands you garbage and learns nothing. Detecting the difference — and escalating instead of returning the shell — is what separates a number that looks good from data you can use.

A worked example: a PerimeterX-protected page

Here's how the pieces compose on a real, stubborn target — a news site behind PerimeterX, where a datacenter exit IP had been used enough that the wall had already flagged it.

One engine, one flagged IP: asking a single stealth browser to fetch the page from that burned IP succeeded only about 1 in 6 times. The engine was capable; the IP was the bottleneck.
Race fresh IPs: firing four requests for the same page concurrently, each through a different exit IP, and taking the first that returned the real article, lifted success to 5 in 6 — and it was faster, because the winner usually came back in a few seconds while the blocked attempts were abandoned.
Reuse the clearance: once one browser earned PerimeterX's clearance cookie, even a lightweight engine that never beats the wall on its own rode that cached cookie straight to the full page. The expensive request paid for the clearance; the cheap ones cashed it in.

No single one of these is a silver bullet. The result comes from stacking them — escalate to a browser, race fresh IPs, reuse what works — and from refusing to count a challenge page as a win. That last point is easy to get wrong: the naive version returns a 200 full of nothing and reports a great success rate.

Build it yourself, or buy it?

You can assemble all of this yourself, and for a single target it's a reasonable weekend project: a stealth-patched browser, a proxy, a cookie cache. The cost shows up later, as a treadmill.

Building it means owning a fleet of patched browser engines (the patches go stale every Chrome release), a residential or mobile proxy budget (the line item that actually sets your ceiling), the escalation logic that picks the cheapest method per site, success detection that isn't fooled by challenge pages, and a standing commitment to re-fix all of it every time a vendor ships a detection update — which is their full-time job and, for you, a side-quest that keeps interrupting the real one.

Buying it — a managed scraping API — trades that treadmill for a per-request price. The honest comparison isn't "API credits versus free code"; it's "API credits versus a proxy bill plus the engineering weeks you'll spend maintaining detection bypasses instead of shipping your product." For one or two easy sites, DIY wins. For a moving list of protected targets you need to keep reliable, the maintenance is the product — and that's the part worth outsourcing.

Where Crawlora fits

This is exactly the model Crawlora's Web Scraping API is built on. A single /web/scrape call escalates on its own — Chrome-impersonated HTTP, then a fleet of stealth browser engines, then fresh IPs raced concurrently — captures and reuses clearance cookies per domain, and only returns a page once it's confirmed to be real content rather than a challenge. You get clean Markdown (or HTML, links, and metadata) back, and you're billed for what succeeds.

A request is one call — ask for the formats you want and let it escalate:

curl -X POST "https://api.crawlora.net/api/v1/web/scrape" \
  -H "x-api-key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/product/123",
    "formats": ["markdown", "links", "metadata"],
    "render": "auto"
  }'

render: "auto" is the escalation switch: it starts with Chrome-impersonated HTTP and climbs to the stealth-browser fleet (and fresh IPs) only if the page demands it, so you don't pay browser prices for pages that don't need a browser. The same call from Python:

import requests

resp = requests.post(
    "https://api.crawlora.net/api/v1/web/scrape",
    headers={"x-api-key": "YOUR_API_KEY"},
    json={
        "url": "https://example.com/product/123",
        "formats": ["markdown", "links", "metadata"],
        "render": "auto",  # escalate HTTP -> stealth browser -> fresh IP as needed
    },
    timeout=60,
)
data = resp.json()["data"]
print(data["markdown"])            # clean article text, not raw HTML
print(data["metadata"]["title"])   # parsed page metadata

If you just want to know what you're up against before you write a line of code, run a URL through the free can-I-scrape-this-site checker: it reports which protection a site uses and how hard it'll be to collect.

A closing note on scope: everything here is for public pages. Bot detection is not a login, and getting past one isn't the same as breaking into private data — but the responsible line is still public data only, with each site's terms and robots directives respected, and personal or copyrighted content left alone. For the legal landscape, see is web scraping legal in 2026; for the related question of gated articles, see how paywalls actually work. Cloudflare's own AI-crawler policy is evolving on top of this same detection stack — see Cloudflare's new AI-crawler defaults for 2026 for what changed and what it means for legitimate scrapers.

Scrape the sites that block everyone else

One API call — auto-escalating stealth browsers, rotating IPs, and clearance reuse — that returns clean Markdown and bills only for what succeeds. 2,000 free credits a month, no card.

Explore the Web Scraping API Check a site for free

Frequently asked questions

How do websites detect and block scrapers?

Modern anti-bot systems check four layers at once: the IP's reputation (datacenter ranges are flagged, residential and mobile are trusted), the TLS and HTTP/2 fingerprint (a real Chrome handshake looks different from a Python or curl one), a JavaScript sensor that probes the browser for automation tells (headless flags, missing APIs, canvas/WebGL quirks), and behaviour over time. Failing any layer earns a CAPTCHA or a block page, so a scraper that only spoofs the User-Agent gets stopped immediately.

Why does my scraper get blocked by Cloudflare or DataDome?

Almost always the IP and the fingerprint. Requests from a cloud server (AWS, GCP, a datacenter proxy) sit in ranges these vendors treat as suspicious, and a non-browser HTTP client has a TLS/JS fingerprint that doesn't match a real Chrome. Cloudflare, DataDome and PerimeterX combine those signals — so the fix isn't a better User-Agent string, it's a real browser fingerprint coming from a trusted IP.

Can you scrape a site protected by Cloudflare, DataDome or PerimeterX?

Often, yes, for public pages — but it's probabilistic, not guaranteed. The reliable approach escalates only as far as a site demands: a Chrome-impersonated HTTP request first, then a real stealth browser that executes the challenge, then a fresh IP if the current one is flagged. Once a browser earns a clearance cookie, cheaper requests can reuse it. The hardest, IP-strict sites need residential or mobile egress to stay reliable.

Is it legal to scrape sites that block bots?

Scraping publicly accessible pages is broadly defensible, and a bot-detection wall is not itself an access control on private data the way a login is. But the law turns on what you collect and how you use it — respect each site's terms and robots directives, avoid personal data and copyrighted content at scale, and never use this to get past a login or a paywall. This is not legal advice.

Tony WangJune 11, 202611 min read

Scraping Sites That Block Bots: Cloudflare, DataDome & PerimeterX

Why scrapers get blocked by Cloudflare, DataDome and PerimeterX — and how to get through reliably with stealth browsers, IP rotation and clearance reuse.

Web Scraping API Anti-Bot Guide

Key takeaways

Bot detection is layered: IP reputation, the TLS/HTTP fingerprint, a JavaScript sensor that probes the browser, and behaviour over time. You have to pass all of them, not one.
A datacenter IP and a non-browser fingerprint are what get most scrapers blocked — not a missing User-Agent. Spoofing headers alone does almost nothing.
The reliable pattern is escalation: a cheap Chrome-impersonated request first, a real stealth browser only when the site demands it, and a fresh IP when the current one is burned.
When a browser earns a clearance cookie, cheaper requests can reuse it — and a request should only count as success when it returns the real page, not a challenge.
None of this is 100%. It's probabilistic, the hardest sites need residential or mobile IPs, and it applies to public data only.

The four layers of bot detection

Layer	What it checks	Why a normal scraper fails
IP reputation	Is the address residential/mobile (trusted) or datacenter (suspect)? Has it made too many requests?	Scrapers run on cloud servers and datacenter proxies — ranges these vendors flag on sight
TLS / HTTP fingerprint	Does the TLS handshake and HTTP/2 frame order match a real browser (JA3/JA4, Akamai fingerprint)?	`requests`, `curl`, and most HTTP libraries have a fingerprint nothing like Chrome's, no matter the headers
JavaScript sensor	A script probes for `navigator.webdriver`, headless tells, missing APIs, canvas/WebGL/font quirks	Headless automation leaks dozens of signals; an HTTP client runs no JavaScript at all, so the sensor never reports back
Behaviour	Request cadence, mouse/scroll, navigation patterns, cookie continuity	Scripts hit pages faster and more mechanically than a person, from a session with no history

What the vendors actually do

The big three behave differently enough that it's worth knowing which wall you're looking at.

Cloudflare issues a managed challenge and, on success, a cf_clearance cookie that's bound to your IP and User-Agent. Pass the challenge and the cookie buys you a window of access — change IP and it's void.
DataDome scores the request in real time and sets a datadome cookie; it's aggressive about datacenter IPs and replays of a fingerprint across too many requests.
PerimeterX / HUMAN runs a JavaScript sensor that POSTs a signal payload and grants a _px3-family clearance cookie. It's strongly IP-bound: a cookie minted on one IP, presented from another, reads as theft and gets blocked harder than no cookie at all.

Which wall, which fix

Each vendor leaves a fingerprint of its own — a tell-tale cookie or block message — and yields to a different lever. This is the rough field guide:

Wall	How you know it's there	What gets through	Difficulty
Cloudflare	`cf_clearance` / `__cf_bm` cookies, a "Checking your browser…" interstitial	A real browser that solves the managed challenge; then reuse `cf_clearance`	Low–medium
DataDome	`datadome` cookie, a 403 with a DataDome CAPTCHA page	Trusted IP + real fingerprint; rotate IPs — it punishes replay	Medium
PerimeterX / HUMAN	`_px*` cookies, "Access to this page has been denied"	A browser that runs the JS sensor; clearance is IP-bound, so race fresh IPs	Medium–high
Akamai Bot Manager	`_abck` / `bm_sz` / `ak_bmsc` cookies	Real TLS fingerprint + browser; the `_abck` cookie must validate or you're shadow-blocked	High
Kasada	`x-kpsdk-*` headers, a `kpsdk` sensor script	Full browser execution of the sensor; among the hardest to pass headlessly	High

The pattern repeats: every one of them ultimately wants to see a real browser, from an IP it trusts, behaving like a person. The walls differ mostly in how strict each of those three checks is.

How to get through reliably

Chrome-impersonated HTTP. A plain request, but with a TLS and HTTP/2 fingerprint that matches real Chrome. This alone clears the fingerprint layer and is enough for a surprising number of "protected" sites — at a fraction of the cost of a browser.
A real stealth browser. When the page needs JavaScript executed — a Cloudflare or PerimeterX challenge — hand it to a fleet of hardened browser engines with patched automation tells and genuine fingerprints. Different engines beat different vendors, so racing or rotating across a fleet matters more than betting on one.
A fresh IP. Anti-bot is IP-reputation-first, so when the current exit is flagged, the highest-yield move is simply a different address. Firing several requests through a rotating pool at once — and taking the first that comes back with the real page — turns a coin-flip into a near-certainty, because one fresh IP usually passes while others are blocked.

Not all addresses are equal, though, and this is where cost enters. IPs come in tiers of reputation, and the more trusted the tier, the more it costs:

Datacenter IPs are cheap and plentiful, but the most flagged — whole ranges are known to belong to clouds and hosting providers, so vendors distrust them by default.
Residential IPs are real home-broadband addresses sourced through proxy networks. They look like ordinary visitors, carry far more trust, and cost meaningfully more.
Mobile IPs are carrier-NAT'd 4G/5G addresses — the most trusted of all, because thousands of real phones share one address, so blocking it risks blocking real customers. They're also the most expensive, usually billed by the gigabyte of traffic rather than per address.

Two refinements make this fast as well as reliable:

Reuse the clearance. When a browser earns a cf_clearance or DataDome cookie, cache it per-domain for its short lifetime and attach it to later requests from the same IP. A cheap engine can then ride a clearance an expensive one paid for — higher success, lower cost.
Only count a real page as success. A challenge page, a "checking your browser" shell, and a 403 all return a 200 with bytes in the body. A scraper that treats those as success hands you garbage and learns nothing. Detecting the difference — and escalating instead of returning the shell — is what separates a number that looks good from data you can use.

A worked example: a PerimeterX-protected page

Here's how the pieces compose on a real, stubborn target — a news site behind PerimeterX, where a datacenter exit IP had been used enough that the wall had already flagged it.

One engine, one flagged IP: asking a single stealth browser to fetch the page from that burned IP succeeded only about 1 in 6 times. The engine was capable; the IP was the bottleneck.
Race fresh IPs: firing four requests for the same page concurrently, each through a different exit IP, and taking the first that returned the real article, lifted success to 5 in 6 — and it was faster, because the winner usually came back in a few seconds while the blocked attempts were abandoned.
Reuse the clearance: once one browser earned PerimeterX's clearance cookie, even a lightweight engine that never beats the wall on its own rode that cached cookie straight to the full page. The expensive request paid for the clearance; the cheap ones cashed it in.

Build it yourself, or buy it?

You can assemble all of this yourself, and for a single target it's a reasonable weekend project: a stealth-patched browser, a proxy, a cookie cache. The cost shows up later, as a treadmill.

Where Crawlora fits

A request is one call — ask for the formats you want and let it escalate:

curl -X POST "https://api.crawlora.net/api/v1/web/scrape" \
  -H "x-api-key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/product/123",
    "formats": ["markdown", "links", "metadata"],
    "render": "auto"
  }'

import requests

resp = requests.post(
    "https://api.crawlora.net/api/v1/web/scrape",
    headers={"x-api-key": "YOUR_API_KEY"},
    json={
        "url": "https://example.com/product/123",
        "formats": ["markdown", "links", "metadata"],
        "render": "auto",  # escalate HTTP -> stealth browser -> fresh IP as needed
    },
    timeout=60,
)
data = resp.json()["data"]
print(data["markdown"])            # clean article text, not raw HTML
print(data["metadata"]["title"])   # parsed page metadata

Scrape the sites that block everyone else

One API call — auto-escalating stealth browsers, rotating IPs, and clearance reuse — that returns clean Markdown and bills only for what succeeds. 2,000 free credits a month, no card.

Explore the Web Scraping API Check a site for free

Frequently asked questions

How do websites detect and block scrapers?

Why does my scraper get blocked by Cloudflare or DataDome?

Can you scrape a site protected by Cloudflare, DataDome or PerimeterX?

Is it legal to scrape sites that block bots?

Scraping Sites That Block Bots: Cloudflare, DataDome & PerimeterX

The four layers of bot detection

What the vendors actually do

Which wall, which fix

How to get through reliably

A worked example: a PerimeterX-protected page

Build it yourself, or buy it?

Where Crawlora fits

Scrape the sites that block everyone else

Frequently asked questions

How Websites Prevent Web Scraping in 2026 (and What Still Works)

How to Scrape Steam in 2026 (Reviews, Charts & Player Counts)

How to Scrape Spotify in 2026 (API & Python)

Is Web Scraping Legal in Japan? A 2026 Guide

Cloudflare Will Crawl the Web for You. It's Locked Out of 29% of Its Own Customers.

How to Scrape CoinGecko in 2026 (API & Python)

Scraping Sites That Block Bots: Cloudflare, DataDome & PerimeterX

The four layers of bot detection

What the vendors actually do

Which wall, which fix

How to get through reliably

A worked example: a PerimeterX-protected page

Build it yourself, or buy it?

Where Crawlora fits

Scrape the sites that block everyone else

Frequently asked questions

How Websites Prevent Web Scraping in 2026 (and What Still Works)

How to Scrape Steam in 2026 (Reviews, Charts & Player Counts)

How to Scrape Spotify in 2026 (API & Python)

Is Web Scraping Legal in Japan? A 2026 Guide

Cloudflare Will Crawl the Web for You. It's Locked Out of 29% of Its Own Customers.

How to Scrape CoinGecko in 2026 (API & Python)