Crawlora
ProductPlatformsUse CasesDocsPricingCompareContact
Sign inTry Playground Console
Crawlora

Structured public web data APIs for search, maps, geocoding, streaming, travel, real estate, marketplaces, apps, social, audio, crypto, finance, and AI workflows with managed execution and credit-based usage.

Product

Web Scraping APIFeaturesPlatformsTravel APIsReal Estate APIsPricing

Platforms

Google SearchGoogle MapsGoogle TrendsBing SearchAmazonLinkedInApple PodcastsZillowTripAdvisorShopifyAll platforms

Developers

DocsGetting StartedAPI ExamplesPlaygroundSDKsGitHub

Use cases

SERP MonitoringSERP Rank Checker APIGoogle Maps LeadsProperty Market IntelligenceAmazon Product MonitoringCrypto Market ResearchAI Agent Web DataAll use cases

Resources

Free Web ScraperAnti-Bot CheckerKeyword ResearchBlogChangelogAll free tools

Legal

ContactTermsPrivacy
Product
Web Scraping APIFeaturesPlatformsTravel APIsReal Estate APIsPricing
Platforms
Google SearchGoogle MapsGoogle TrendsBing SearchAmazonLinkedInApple PodcastsZillowTripAdvisorShopifyAll platforms
Developers
DocsGetting StartedAPI ExamplesPlaygroundSDKsGitHub
Use cases
SERP MonitoringSERP Rank Checker APIGoogle Maps LeadsProperty Market IntelligenceAmazon Product MonitoringCrypto Market ResearchAI Agent Web DataAll use cases
Resources
Free Web ScraperAnti-Bot CheckerKeyword ResearchBlogChangelogAll free tools
Legal
ContactTermsPrivacy
© 2026 Crawlora. All rights reserved.·Built by Tony Wang
System statusCrawlora API status
  1. Home
  2. /Blog
  3. /Scraping Sites That Block Bots: Cloudflare, DataDome & PerimeterX
By Tony WangTony WangJune 11, 202611 min read

Scraping Sites That Block Bots: Cloudflare, DataDome & PerimeterX

Why scrapers get blocked by Cloudflare, DataDome and PerimeterX — and how to get through reliably with stealth browsers, IP rotation and clearance reuse.

Web Scraping APIAnti-BotGuide

For most of the web, scraping is a solved problem: fetch the URL, parse the HTML, done. The interesting sites — the ones with prices, listings, reviews, and inventory worth collecting — are exactly the ones that don't let you. They sit behind Cloudflare, DataDome, PerimeterX (now HUMAN), Akamai, or Kasada, and the moment a script asks for a page it gets a CAPTCHA, a "checking your browser" interstitial, or a flat 403. The hard part of modern scraping isn't parsing the page. It's getting the page.

This guide explains how that wall actually works — the signals these systems check and why a normal HTTP client trips every one of them — and then how a scraper gets through reliably without pretending the problem is simpler than it is.

Key takeaways

  • Bot detection is layered: IP reputation, the TLS/HTTP fingerprint, a JavaScript sensor that probes the browser, and behaviour over time. You have to pass all of them, not one.
  • A datacenter IP and a non-browser fingerprint are what get most scrapers blocked — not a missing User-Agent. Spoofing headers alone does almost nothing.
  • The reliable pattern is escalation: a cheap Chrome-impersonated request first, a real stealth browser only when the site demands it, and a fresh IP when the current one is burned.
  • When a browser earns a clearance cookie, cheaper requests can reuse it — and a request should only count as success when it returns the real page, not a challenge.
  • None of this is 100%. It's probabilistic, the hardest sites need residential or mobile IPs, and it applies to public data only.

The four layers of bot detection

Anti-bot vendors don't rely on a single check. They score every request across several independent layers, and a request has to look human on all of them. Understanding the layers is the whole game, because each one rules out a different class of naive scraper.

LayerWhat it checksWhy a normal scraper fails
IP reputationIs the address residential/mobile (trusted) or datacenter (suspect)? Has it made too many requests?Scrapers run on cloud servers and datacenter proxies — ranges these vendors flag on sight
TLS / HTTP fingerprintDoes the TLS handshake and HTTP/2 frame order match a real browser (JA3/JA4, Akamai fingerprint)?requests, curl, and most HTTP libraries have a fingerprint nothing like Chrome's, no matter the headers
JavaScript sensorA script probes for navigator.webdriver, headless tells, missing APIs, canvas/WebGL/font quirksHeadless automation leaks dozens of signals; an HTTP client runs no JavaScript at all, so the sensor never reports back
BehaviourRequest cadence, mouse/scroll, navigation patterns, cookie continuityScripts hit pages faster and more mechanically than a person, from a session with no history

The reason "just set a realistic User-Agent" stopped working years ago is that the User-Agent is a single string in the least important layer. You can claim to be Chrome 140 all you like; if your TLS handshake says Python and your IP says AWS, you've already failed two checks before the page even loads.

Why headless Chrome alone isn't enough

Running Puppeteer or Playwright with stock headless Chrome fixes the JavaScript layer better than an HTTP client — it actually executes the sensor — but headless mode leaks its own tells (navigator.webdriver, automation-controlled flags, subtle rendering differences), and it does nothing for the IP or TLS layers. That's why "use a real browser" and "use a good proxy" are both necessary and neither is sufficient on its own.

What the vendors actually do

The big three behave differently enough that it's worth knowing which wall you're looking at.

  • Cloudflare issues a managed challenge and, on success, a cf_clearance cookie that's bound to your IP and User-Agent. Pass the challenge and the cookie buys you a window of access — change IP and it's void.
  • DataDome scores the request in real time and sets a datadome cookie; it's aggressive about datacenter IPs and replays of a fingerprint across too many requests.
  • PerimeterX / HUMAN runs a JavaScript sensor that POSTs a signal payload and grants a _px3-family clearance cookie. It's strongly IP-bound: a cookie minted on one IP, presented from another, reads as theft and gets blocked harder than no cookie at all.

The common thread: a clearance — the cookie that says "this client passed" — is only earned by executing the challenge in a real browser, and it's tied to the IP that earned it. That single fact dictates the entire strategy.

Which wall, which fix

Each vendor leaves a fingerprint of its own — a tell-tale cookie or block message — and yields to a different lever. This is the rough field guide:

WallHow you know it's thereWhat gets throughDifficulty
Cloudflarecf_clearance / __cf_bm cookies, a "Checking your browser…" interstitialA real browser that solves the managed challenge; then reuse cf_clearanceLow–medium
DataDomedatadome cookie, a 403 with a DataDome CAPTCHA pageTrusted IP + real fingerprint; rotate IPs — it punishes replayMedium
PerimeterX / HUMAN_px* cookies, "Access to this page has been denied"A browser that runs the JS sensor; clearance is IP-bound, so race fresh IPsMedium–high
Akamai Bot Manager_abck / bm_sz / ak_bmsc cookiesReal TLS fingerprint + browser; the _abck cookie must validate or you're shadow-blockedHigh
Kasadax-kpsdk-* headers, a kpsdk sensor scriptFull browser execution of the sensor; among the hardest to pass headlesslyHigh

The pattern repeats: every one of them ultimately wants to see a real browser, from an IP it trusts, behaving like a person. The walls differ mostly in how strict each of those three checks is.

How to get through reliably

The mistake most scrapers make is doing one thing — a stealth browser, or a residential proxy — for every request. That's slow and expensive on easy pages and still fragile on hard ones. The reliable pattern is escalation: start cheap, and climb only as far as a specific site forces you to.

  1. Chrome-impersonated HTTP. A plain request, but with a TLS and HTTP/2 fingerprint that matches real Chrome. This alone clears the fingerprint layer and is enough for a surprising number of "protected" sites — at a fraction of the cost of a browser.
  2. A real stealth browser. When the page needs JavaScript executed — a Cloudflare or PerimeterX challenge — hand it to a fleet of hardened browser engines with patched automation tells and genuine fingerprints. Different engines beat different vendors, so racing or rotating across a fleet matters more than betting on one.
  3. A fresh IP. Anti-bot is IP-reputation-first, so when the current exit is flagged, the highest-yield move is simply a different address. Firing several requests through a rotating pool at once — and taking the first that comes back with the real page — turns a coin-flip into a near-certainty, because one fresh IP usually passes while others are blocked.

Not all addresses are equal, though, and this is where cost enters. IPs come in tiers of reputation, and the more trusted the tier, the more it costs:

  • Datacenter IPs are cheap and plentiful, but the most flagged — whole ranges are known to belong to clouds and hosting providers, so vendors distrust them by default.
  • Residential IPs are real home-broadband addresses sourced through proxy networks. They look like ordinary visitors, carry far more trust, and cost meaningfully more.
  • Mobile IPs are carrier-NAT'd 4G/5G addresses — the most trusted of all, because thousands of real phones share one address, so blocking it risks blocking real customers. They're also the most expensive, usually billed by the gigabyte of traffic rather than per address.

The reflex on a hard wall is to reach straight for the priciest IPs. The cheaper play is the one this whole section is about: lean on the stack — escalate to a real browser, race several datacenter IPs at once, and reuse a clearance once you earn it — so you get through on low-cost addresses as often as possible, and only fall back to residential or mobile for the targets that genuinely demand them. You pay for reputation exactly when the page forces you to, and not a request sooner.

Two refinements make this fast as well as reliable:

  • Reuse the clearance. When a browser earns a cf_clearance or DataDome cookie, cache it per-domain for its short lifetime and attach it to later requests from the same IP. A cheap engine can then ride a clearance an expensive one paid for — higher success, lower cost.
  • Only count a real page as success. A challenge page, a "checking your browser" shell, and a 403 all return a 200 with bytes in the body. A scraper that treats those as success hands you garbage and learns nothing. Detecting the difference — and escalating instead of returning the shell — is what separates a number that looks good from data you can use.

The honest part: it's probabilistic, and IPs are the ceiling

No one beats every wall every time, and anyone claiming 100% is selling something. The technique raises the floor; the ceiling is your egress. Datacenter IPs — even rotated — get you through soft and medium protection. The hardest, IP-strict targets (some PerimeterX and DataDome deployments) only stay reliable from residential or mobile addresses. Honest success-based pricing matters here too: you should pay for the request that actually returned the page, not for every blocked attempt along the way.

A worked example: a PerimeterX-protected page

Here's how the pieces compose on a real, stubborn target — a news site behind PerimeterX, where a datacenter exit IP had been used enough that the wall had already flagged it.

  • One engine, one flagged IP: asking a single stealth browser to fetch the page from that burned IP succeeded only about 1 in 6 times. The engine was capable; the IP was the bottleneck.
  • Race fresh IPs: firing four requests for the same page concurrently, each through a different exit IP, and taking the first that returned the real article, lifted success to 5 in 6 — and it was faster, because the winner usually came back in a few seconds while the blocked attempts were abandoned.
  • Reuse the clearance: once one browser earned PerimeterX's clearance cookie, even a lightweight engine that never beats the wall on its own rode that cached cookie straight to the full page. The expensive request paid for the clearance; the cheap ones cashed it in.

No single one of these is a silver bullet. The result comes from stacking them — escalate to a browser, race fresh IPs, reuse what works — and from refusing to count a challenge page as a win. That last point is easy to get wrong: the naive version returns a 200 full of nothing and reports a great success rate.

Build it yourself, or buy it?

You can assemble all of this yourself, and for a single target it's a reasonable weekend project: a stealth-patched browser, a proxy, a cookie cache. The cost shows up later, as a treadmill.

Building it means owning a fleet of patched browser engines (the patches go stale every Chrome release), a residential or mobile proxy budget (the line item that actually sets your ceiling), the escalation logic that picks the cheapest method per site, success detection that isn't fooled by challenge pages, and a standing commitment to re-fix all of it every time a vendor ships a detection update — which is their full-time job and, for you, a side-quest that keeps interrupting the real one.

Buying it — a managed scraping API — trades that treadmill for a per-request price. The honest comparison isn't "API credits versus free code"; it's "API credits versus a proxy bill plus the engineering weeks you'll spend maintaining detection bypasses instead of shipping your product." For one or two easy sites, DIY wins. For a moving list of protected targets you need to keep reliable, the maintenance is the product — and that's the part worth outsourcing.

Where Crawlora fits

This is exactly the model Crawlora's Web Scraping API is built on. A single /web/scrape call escalates on its own — Chrome-impersonated HTTP, then a fleet of stealth browser engines, then fresh IPs raced concurrently — captures and reuses clearance cookies per domain, and only returns a page once it's confirmed to be real content rather than a challenge. You get clean Markdown (or HTML, links, and metadata) back, and you're billed for what succeeds.

A request is one call — ask for the formats you want and let it escalate:

curl -X POST "https://api.crawlora.net/api/v1/web/scrape" \
  -H "x-api-key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/product/123",
    "formats": ["markdown", "links", "metadata"],
    "render": "auto"
  }'

render: "auto" is the escalation switch: it starts with Chrome-impersonated HTTP and climbs to the stealth-browser fleet (and fresh IPs) only if the page demands it, so you don't pay browser prices for pages that don't need a browser. The same call from Python:

import requests

resp = requests.post(
    "https://api.crawlora.net/api/v1/web/scrape",
    headers={"x-api-key": "YOUR_API_KEY"},
    json={
        "url": "https://example.com/product/123",
        "formats": ["markdown", "links", "metadata"],
        "render": "auto",  # escalate HTTP -> stealth browser -> fresh IP as needed
    },
    timeout=60,
)
data = resp.json()["data"]
print(data["markdown"])            # clean article text, not raw HTML
print(data["metadata"]["title"])   # parsed page metadata

If you just want to know what you're up against before you write a line of code, run a URL through the free can-I-scrape-this-site checker: it reports which protection a site uses and how hard it'll be to collect.

A closing note on scope: everything here is for public pages. Bot detection is not a login, and getting past one isn't the same as breaking into private data — but the responsible line is still public data only, with each site's terms and robots directives respected, and personal or copyrighted content left alone. For the legal landscape, see is web scraping legal in 2026; for the related question of gated articles, see how paywalls actually work.

Sources

  • Cloudflare — Bot management & managed challenges
  • DataDome — How bot protection works
  • HUMAN (PerimeterX) — Bot defender
  • Wikipedia — TLS fingerprinting (JA3/JA4)
  • Crawlora — Can I scrape this site? (free checker)
  • Crawlora — Web Scraping API
  • Crawlora — Is web scraping legal in 2026?

Scrape the sites that block everyone else

One API call — auto-escalating stealth browsers, rotating IPs, and clearance reuse — that returns clean Markdown and bills only for what succeeds. 2,000 free credits a month, no card.

Explore the Web Scraping APICheck a site for free

Frequently asked questions

How do websites detect and block scrapers?

Modern anti-bot systems check four layers at once: the IP's reputation (datacenter ranges are flagged, residential and mobile are trusted), the TLS and HTTP/2 fingerprint (a real Chrome handshake looks different from a Python or curl one), a JavaScript sensor that probes the browser for automation tells (headless flags, missing APIs, canvas/WebGL quirks), and behaviour over time. Failing any layer earns a CAPTCHA or a block page, so a scraper that only spoofs the User-Agent gets stopped immediately.

Why does my scraper get blocked by Cloudflare or DataDome?

Almost always the IP and the fingerprint. Requests from a cloud server (AWS, GCP, a datacenter proxy) sit in ranges these vendors treat as suspicious, and a non-browser HTTP client has a TLS/JS fingerprint that doesn't match a real Chrome. Cloudflare, DataDome and PerimeterX combine those signals — so the fix isn't a better User-Agent string, it's a real browser fingerprint coming from a trusted IP.

Can you scrape a site protected by Cloudflare, DataDome or PerimeterX?

Often, yes, for public pages — but it's probabilistic, not guaranteed. The reliable approach escalates only as far as a site demands: a Chrome-impersonated HTTP request first, then a real stealth browser that executes the challenge, then a fresh IP if the current one is flagged. Once a browser earns a clearance cookie, cheaper requests can reuse it. The hardest, IP-strict sites need residential or mobile egress to stay reliable.

Is it legal to scrape sites that block bots?

Scraping publicly accessible pages is broadly defensible, and a bot-detection wall is not itself an access control on private data the way a login is. But the law turns on what you collect and how you use it — respect each site's terms and robots directives, avoid personal data and copyrighted content at scale, and never use this to get past a login or a paywall. This is not legal advice.

About the author

Tony Wang

Tony Wang · Founder, Crawlora

Tony Wang is the founder of Crawlora and a senior software engineer with 9+ years across backend, cloud infrastructure, and large-scale web crawling — including distributed scrapers that have collected millions of profiles. He writes about web scraping, SERP and MCP APIs, and AI-agent data workflows.

View profiletonywang.io
Back to blog

Related posts

How to Scrape Google Trends in 2026 (API & Python)

Get Google Trends data in 2026 — interest over time, rising and top queries, and trending searches — as structured JSON via API, with the legal basics.

How Paywalls Actually Work: The Engineering Behind Them

How news paywalls work: hard vs metered, client- vs server-side rendering, the Googlebot JSON-LD contract, and why some are easy to read and others aren't.

How to Scrape Brave Search in 2026 (API & Python)

Three ways to scrape Brave Search in 2026 — DIY Python, no-code tools, or a structured API for web, news, and video results — with the legal basics.

AI vs Traditional Web Scraping: Which Wins, When

AI vs traditional web scraping: how LLM extraction, CSS selectors, and structured data APIs differ — and when each one wins for clean, reliable data.

Web Scraping vs API: Which Should You Use in 2026?

Web scraping vs official APIs in 2026 — when to scrape, when to use an API, and how a structured scraping API gives you both, with the legal basics.

Web Scraping for AI Training Data: A Compliant Guide

How to source web data for AI training and RAG compliantly — provenance, licensing, robots and terms, dedupe, and PII — without maintaining scrapers.

Browse Docs Try Playground