Data study · June 2026
The headline bot reports measure how much trafficis automated. We measured the other side of the ledger — the walls. Across the top 1–10 million sites, 53.5% now run a managed anti-bot wall, one in eleven blocks an AI crawler, and 14.1% are already dead.
The web’s defensive posture, June 2026
53.5%
of reachable top-1M sites run a managed anti-bot / WAF wall
9.33%
fully block at least one major AI crawler in robots.txt
14.1%
of the top 10M are genuinely dead — not the often-cited 27.6%
Independent datacenter-IP scans of the Tranco top 1M and top 10M (June 2026). Every figure is a conservative lower bound.
Each study answers a different question about access to public web data. Together they describe a web that is harder to read by machine every year — and explain why reliable collection now depends on handling walls, not just fetching pages.
53.5%
The web is walling up
A managed anti-bot / WAF wall now fronts most reachable sites — and Cloudflare alone is about 84% of them. The walls get denser down the ranking, not thinner.
9.33%
AI is gated selectively
Few sites block AI crawlers outright, and those that do target training (GPTBot ≈ CCBot) far more than the agents that send users back. Blocking is concentrated at the head and led by news.
14.1%
The tail is decaying
One in seven top-10M domains is genuinely gone — almost all of it DNS that no longer resolves, and almost all of it below rank 100,000. Weighted by traffic, the dead web is closer to 3%.
Of 998,497 sites in the Tranco top 1M, 818,614 were reachable and 437,857 of those front a managed anti-bot or WAF. Cloudflare is about 45% of the reachable web and ~84% of every protected site. The surprise is the gradient: protection and Cloudflare rise as you descend the ranking, while enterprise bot-management (DataDome, Akamai, PerimeterX) does the reverse — the head and the tail are defended by different vendors.
| Rank band | Reachable | Run a wall | Cloudflare | Enterprise bot-mgmt |
|---|---|---|---|---|
| Top 1K | 74.0% | 44.2% | 23.4% | 7.58% |
| 1K–10K | 81.1% | 50.7% | 34.0% | 5.29% |
| 10K–100K | 81.7% | 52.3% | 40.2% | 2.55% |
| 100K–1M | 82.0% | 53.6% | 45.6% | 0.60% |
Full vendor leaderboard, per-category difficulty, and the live numbers: the Anti-Bot Adoption Index (open dataset, CC BY 4.0).
Across the same top 1M, 9.33% of sites fully block at least one major AI crawler in robots.txt (14.8% of the sites that publish a robots.txt at all). OpenAI's GPTBot leads at 7.4%, but Common Crawl's CCBot is a near-tie at 7.23%. The pattern is deliberate: sites block the crawlers that train models far harder than the agents that return referral traffic — and unlike anti-bot walls, AI-blocking is concentrated among the busiest, most content-heavy sites (news & media lead at ~81%).
| Crawler | Purpose | Sites fully blocking |
|---|---|---|
| GPTBot (OpenAI) | Training | 7.40% |
| CCBot (Common Crawl) | Training | 7.23% |
| Bytespider (ByteDance) | Training | 6.77% |
| ClaudeBot (Anthropic) | Training | 6.69% |
| Google-Extended | Training | 6.28% |
| ChatGPT-User (OpenAI) | AI assistant | 1.59% |
| PerplexityBot | AI search | 1.29% |
| OAI-SearchBot (OpenAI) | AI search | 0.68% |
Per-crawler, per-category and per-TLD breakdowns, plus the open dataset: the AI-Crawler Blocking Index.
We probed all 10 million most-popular domains (9.99M reached, 99.95% coverage) both politely and with a real Chrome fingerprint. Genuinely dead is 14.1% — about half the figure naive crawls report — because a 403, a 429, or a served 404 is not death; those sites are alive and blocking, or misconfigured. Real death is DNS: 76% of dead domains no longer resolve. And it is a tail phenomenon — 99.8% of dead domains sit below rank 100,000, so weighted by traffic the dead web is closer to 3%.
Outcome of the top 10M
| Outcome | Share | What it means |
|---|---|---|
| Alive | 76.6% | responds with usable content |
| Dead | 14.1% | DNS gone or no server — 76% no longer resolve at all |
| Blocked | 8.9% | answers but blocks bots (403 / 429 / anti-bot) |
| Redirect | 0.3% | redirects off the original domain |
Dead rate by country (ccTLD)
| Country | Dead |
|---|---|
| China (.cn) | 33.0% |
| India (.in) | 25.9% |
| United States (.us) | 22.0% |
| Brazil (.br) | 20.9% |
| Spain (.es) | 16.6% |
| Japan (.jp) | 15.6% |
| United Kingdom (.uk) | 15.3% |
| France (.fr) | 14.5% |
| Germany (.de) | 7.6% |
The full reachability funnel, redirect analysis and open dataset: the Dead-Web Index.
Three independent scans run in June 2026 from datacenter IPs, without residential proxies. That makes every figure a conservative lower bound: a site we record as walled is at least that walled, and one we record as reachable cleared a low bar.
robots.txt policy across 20 AI user-agents. It measures declared intent, not enforcement or traffic.In an independent June 2026 scan of the Tranco top 1 million sites (998,497 scanned, 818,614 reachable), 53.5% of reachable sites run a managed anti-bot or WAF wall. Cloudflare alone accounts for about 84% of every protected site. Counter-intuitively, protection rises as you go down the ranking: 44.2% of the top 1,000 versus 53.6% of the 100K–1M band.
GPTBot (OpenAI) is the single most-blocked at 7.4% of all top-1M sites, with Common Crawl's CCBot a near-tie at 7.23% — the open training corpus is blocked almost as hard as OpenAI's own crawler. Overall only 9.33% of sites block at least one major AI crawler. Training crawlers are blocked roughly 5x more than the AI-search and assistant agents (PerplexityBot, ChatGPT-User, OAI-SearchBot) that send referral traffic back.
14.1% of the top 10 million domains are genuinely dead — far below the widely-cited 27.6% from earlier naive crawls. The difference is method: a naive crawl counts a 403, a 429, or a served 404/5xx as 'dead', but those domains are alive and merely blocking bots (8.9%) or misconfigured. True death is overwhelmingly DNS failure — 76% of dead domains no longer resolve at all.
Those reports measure bot TRAFFIC volume — what share of requests come from bots (around 53% of all traffic). This report measures the opposite side of the ledger: the web's defensive POSTURE and reachability, site by site, across the top 1M–10M domains. It answers 'how much of the web is walled, AI-gated, or gone?' rather than 'how much traffic is automated?'
Yes — all three datasets are published under CC BY 4.0 and explorable at the live indexes. Every scan ran from datacenter IPs without residential proxies, so each figure is a conservative lower bound. Anti-bot adoption is homepage HTTP fingerprinting; AI-crawler blocking parses stated robots.txt policy (not enforcement); dead-web reachability probes both politely and with a real Chrome fingerprint, counting only genuinely unreachable domains as dead.
More than half the web now sits behind a managed wall, and the share keeps climbing. Crawlora returns structured data from documented endpoints across search, maps, commerce, social and finance — with pay-on-success billing, so you only pay for the calls that get through.