How we measured it.

The Anti-Bot Adoption Index is a passive, reproducible fingerprint of the full top 1,000,000 sites, plus a full-transport-fleet difficulty grade on 1005 curated sites. We publish the method in full so the numbers are checkable — and so you know exactly what a “protected” result does and doesn’t mean.

The sample

Tranco-seeded, categorised.

Both populations are seeded from the Tranco research list 6WNKX (2026-06-11) — an aggregated, citable top-sites ranking with a permanent ID, so the seed is reproducible rather than hand-picked.

The web-scale figures come from the full top 1,000,000; the difficulty and category deep-dive comes from 1005 sites walked from the top and bucketed into 28 categories (pure CDN/infrastructure, ad/tracking, auth-only and adult domains skipped). Of those 1005, 1005 were reachable; unreachable sites are excluded from percentages.

The probe

One passive request.

Each site’s homepage is fetched once with a real Chrome user-agent, following redirects, from a datacenter IP— the honest “what a basic cloud scraper sees” vantage. We capture the response headers, the Set-Cookie names, and a capped slice of the body.

We never run a CAPTCHA, submit a form, or log in. The top-1M scan is that single datacenter GET. For the curated set we go one step further: any site that blocksthe datacenter GET is re-probed through the full proxied transport fleet (browser-impersonation → headless → stealth + residential), so its difficulty band reflects what actually reaches the page — the same engine as Crawlora’s anti-bot checker.

If the bare apex returns no response at all, we retry once on www.<domain>— many sites serve HTTPS only on the www host. A site we still can’t reach is labelled “No probe response”rather than “offline”: from a datacenter vantage that usually means a hard IP block or a geo-restriction, and the site is often still live behind a residential/regional IP. Treat that bucket as a lower bound, not a list of dead sites.

The signatures

What names a vendor.

The vendor is identified from public, documented fingerprints — response header names (e.g. cf-ray, x-datadome), Set-Cookie names (_abck, datadome, _px), and, only on a challenge-shaped response, body markers. Header/cookie matches are high-confidence; body markers are medium.

CAPTCHA widgets are typed and version-classified from script sources and markup — reCAPTCHA v2/v3/Enterprise, hCaptcha, Cloudflare Turnstile, Arkose FunCaptcha, GeeTest v3/v4, AWS WAF and others. Most sites only show a CAPTCHA on login/checkout, so homepage CAPTCHA counts are a floor.

Proprietary signed-payload VMs (TikTok’s webmssdk/X-Bogus, Kasada, F5/Shape) are flagged as a distinct “closed VM” class.

Vendors detected in this run

Cloudflare 333Akamai Bot Manager 111Akamai (edge) 53DataDome 30Imperva (Incapsula) 15PerimeterX (HUMAN) 15Cloudflare Turnstile 12Google reCAPTCHA 6

The difficulty score

The lightest transport that reached the page.

For the curated set, the band is the lightest transport that actually reached each page: a site that blocks a datacenter request is re-probed through the full proxied fleet (browser-impersonation → headless → stealth + residential). It's a point-in-time snapshot, not a guarantee — a grade can shift with IP reputation. The top-1M scan is HTTP-tier only and is not difficulty-graded.

Plain HTTP clientband: Easy

No managed anti-bot detected — a plain HTTP request reaches it.

Browser-impersonation HTTPband: Medium

Wants a matched TLS/JA3-JA4 fingerprint, correct HTTP/2 frame order, and realistic headers. Akamai and open Cloudflare paths live here.

Headless browser (JS)band: Hard

Needs a real browser to execute the vendor's JavaScript challenge — Cloudflare managed challenge, Imperva, AWS WAF challenge, most CAPTCHA gates.

Stealth browser + residential IP + behaviorband: Very hard

The fine print

What this does not tell you.

This check can be inaccurate or out of date

Anti-bot is deliberately dynamic, so a snapshot like this can be wrong in both directions — and the vendors update their models constantly, often daily.

Homepages are the open front door. Login, checkout, search and deep listings are usually protected more heavily than the homepage we tested.
Difficulty is a moving snapshot, not a fixed property.What you see depends on the IP and its live reputation: the same page returns content from a clean residential IP but a challenge from a flagged datacenter one — and even a single IP flips over time, so a site can read “medium” one run and “hard” the next. This study ran from a datacenter; read each band as a sample, not a guarantee.
Challenges are conditional. Cloudflare managed challenge, DataDome and PerimeterX only trip on suspicious signals, so a protection can be present but invisible to a passive scan.
Vendors ship updates and per-customer configs constantly.Akamai added JA4 fingerprinting in 2026; a signature that’s correct today can be renamed or reconfigured tomorrow.
“Not detected” does not mean “easy.”It can mean a protection we didn’t recognise, a challenge that hadn’t triggered, or behavioural/TLS defenses that don’t show up in passive HTML. This bites hardest at the head of the web, where the largest sites are the most likely to run undetectable in-house systems — so the top-rank “protected” rates are a floor, not a ceiling.

Back to the index →Download the dataset (GitHub) →Read the analysis →

Snapshot 2026-06-13. Licensed CC BY 4.0 — cite as “Crawlora Anti-Bot Adoption Index” with a link.