The Anti-Bot Adoption Index is a passive, reproducible fingerprint of 1005top sites. We publish the method in full so the numbers are checkable — and so you know exactly what a “protected” result does and doesn’t mean.
The 1005 domains are seeded from the Tranco research list 6WNKX (2026-06-11) — an aggregated, citable top-sites ranking with a permanent ID, so the seed is reproducible rather than hand-picked.
We walk the ranking from the top and keep real, public, content-bearing sites, bucketed into 28 categories. Pure CDN/infrastructure, ad/tracking, auth-only and adult domains are skipped. On the run date, 960 of 1005 were reachable; the rest are excluded from percentages.
Each site’s homepage is fetched once with a real Chrome user-agent, following redirects, from a datacenter IP— the honest “what a basic cloud scraper sees” vantage. We capture the response headers, the Set-Cookie names, and a capped slice of the body.
We never run a CAPTCHA, submit a form, or attempt to access anything. It is a single GET of a public homepage, the same signal set as Crawlora’s anti-bot checker.
The vendor is identified from public, documented fingerprints — response header names (e.g. cf-ray, x-datadome), Set-Cookie names (_abck, datadome, _px), and, only on a challenge-shaped response, body markers. Header/cookie matches are high-confidence; body markers are medium.
CAPTCHA widgets are typed and version-classified from script sources and markup — reCAPTCHA v2/v3/Enterprise, hCaptcha, Cloudflare Turnstile, Arkose FunCaptcha, GeeTest v3/v4, AWS WAF and others. Most sites only show a CAPTCHA on login/checkout, so homepage CAPTCHA counts are a floor.
Proprietary signed-payload VMs (TikTok’s webmssdk/X-Bogus, Kasada, F5/Shape) are flagged as a distinct “closed VM” class.
Vendors detected in this run
CAPTCHA types surfaced
We map the strongest detected protection to the typical tier of tooling generally required to reliably access public pages — bumped one tier when we actually saw a challenge. It's derived from headers/HTML, not a live multi-transport measurement, so read it as directional.
Plain HTTP clientband: Easy
No managed anti-bot detected — a plain HTTP request reaches it.
Browser-impersonation HTTPband: Medium
Wants a matched TLS/JA3-JA4 fingerprint, correct HTTP/2 frame order, and realistic headers. Akamai and open Cloudflare paths live here.
Headless browser (JS)band: Hard
Needs a real browser to execute the vendor's JavaScript challenge — Cloudflare managed challenge, Imperva, AWS WAF challenge, most CAPTCHA gates.
Stealth browser + residential IP + behaviorband: Very hard
Weighs behavior and IP reputation on top of JS — DataDome and PerimeterX/HUMAN.
Closed signed-payload VMband: Closed VM
Signs every request with a proprietary in-browser bytecode VM — TikTok (webmssdk), Kasada, F5/Shape. Generic transport tooling can't mint valid tokens.
A CAPTCHA gate lifts an otherwise-open site to at least T3; a detected closed-VM defense sets T5. The 1–10 difficulty score nudges up for a CAPTCHA or a hard block. Charged-for, real-engine difficulty (running the actual transport fleet per URL) is what Crawlora’s anti-bot checker does for a single URL.
When a request doesn’t pass, we read why. From the status code, response headers and cookies we separate a rate limit (429 / Retry-After), an IP ban (Cloudflare 1006–1008), a bot challenge (cf-mitigated), a CAPTCHA, a geo-block (451/1009), and a login wall (401 / a redirect to /login). The fix differs each time — rotate IPs for a rate limit, a real browser for a challenge — so an auth wall is its own first-class class, not a difficulty. Per-site pages add an advisory deep-page test plan, because homepage protection is a poor proxy for the profile, search and checkout pages you actually scrape.
This check can be inaccurate or out of date
Anti-bot is deliberately dynamic, so a snapshot like this can be wrong in both directions — and the vendors update their models constantly, often daily.
Snapshot 2026-06-12. Licensed CC BY 4.0 — cite as “Crawlora Anti-Bot Adoption Index” with a link.