Data study · updated June 20, 2026
We fetched the robots.txt of the Tranco top 1,000,000 and recorded which AI crawlers each site disallows. The headline: almost nobodyblocks them — and blocking is concentrated among the busiest sites, fading into the long tail. GPTBot leads, but Common Crawl’s CCBot is right behind it.
9.33%
of the 998,497 sites fully block at least one major AI crawler in robots.txt — just 14.8% of those that publish one.
7.4%
block GPTBot — the single most-blocked AI crawler
7.19%
block every bot via * / Disallow: /
Independent robots.txt scan of the Tranco top 1,000,000 (June 20, 2026). “Fully blocks” = the crawler’s user-agent named with Disallow: /.
Share of the Tranco top 1,000,000 that fully block each AI crawler (Disallow: /). The training crawlers (red) cluster at the top; the AI-search / assistant agents that send traffic back (blue) are blocked far less.
GPTBot (7.4%) and CCBot (7.23%) are neck-and-neck at the top. Blocking GPTBot but not CCBot is leaving the front door open: the Common Crawl corpus is what most models train on indirectly.
The training crawlers — GPTBot, CCBot, ClaudeBot, Google-Extended, Bytespider — sit at ~6.9% on average. The search / assistant agents that return clicks — PerplexityBot, ChatGPT-User, OAI-SearchBot — are blocked on only ~1.1%.
“Fully blocks” counts only a Disallow: /under the crawler’s own user-agent. Many more sites name a crawler with a narrower rule — see the table below.
Share of each Tranco rank band that fully blocks at least one AI crawler. It peaks among the most-trafficked sites (~13.52% of the top 10,000) and fades into the long tail (~9.13%) — the inverse of managed anti-bot/WAF adoption, which climbs as you go down the ranking. The sites most worth scraping are the ones that bothered to add the line.
Among major sites in each vertical the split is sharp: news, media, reference and community sites — whose archive is the asset — block AI crawlers heavily, while transactional and functional sites (commerce, finance, SaaS, government, search) barely bother. The categories that most want to be cited block the least.
News & media leads at 80.6%— and blocks CCBot (66.7%) harder than GPTBot (47.2%): publishers defend the Common Crawl training corpus first.
Social platforms score high too, but mostly via a blanket 30.6% * / Disallow: / wall — login-gated to protect user data, catching AI as collateral. News writes AI-specific rules instead.
Search engines (2.9%), government and developer sites block the least — their whole point is to be indexed and cited.
Rates are for the ~36 major sites in each vertical, so they run higher than the 9.33% whole-1M average. The ranking across categories is the robust signal.
| Category | Block ≥1 AI (%) | GPTBot % | CCBot % | Blocks all (*) % |
|---|---|---|---|---|
| News & media | 80.6% | 47.2% | 66.7% | 8.3% |
| Entertainment | 69.4% | 50% | 61.1% | 2.8% |
| Reference & wiki | 44.4% | 27.8% | 27.8% | 5.6% |
| Sports | 41.7% | 30.6% | 30.6% | 2.8% |
| Social media | 33.3% | 30.6% | 13.9% | 30.6% |
| Healthcare | 33.3% | 16.7% | 30.6% | 0% |
| Forums & community | 33.3% | 30.6% | 22.2% | 8.3% |
| Food & delivery | 30.6% | 11.1% | 25% | 5.6% |
| Classifieds | 27.8% | 13.9% | 27.8% | 8.3% |
| Gaming | 27.8% | 13.9% | 16.7% | 0% |
| Music & audio | 25.7% | 22.9% | 25.7% | 5.7% |
| Education | 25% | 19.4% | 13.9% | 2.8% |
| Other | 25% | 13.9% | 19.4% | 2.8% |
| Marketplaces | 22.2% | 19.4% | 13.9% | 2.8% |
| Video & streaming | 19.4% | 13.9% | 11.1% | 8.3% |
| Real estate | 19.4% | 8.3% | 2.8% | 0% |
| Jobs & recruiting | 13.9% | 8.3% | 5.6% | 5.6% |
| Automotive | 13.9% | 8.3% | 8.3% | 0% |
| E-commerce | 11.1% | 11.1% | 5.6% | 0% |
| Crypto | 11.1% | 5.6% | 5.6% | 0% |
| SaaS & business | 11.1% | 5.6% | 8.3% | 0% |
| AI tools | 11.1% | 2.8% | 2.8% | 5.6% |
| Travel & hospitality | 8.3% | 2.8% | 2.8% | 0% |
| Finance & markets | 8.3% | 0% | 2.8% | 2.8% |
| Government | 8.3% | 2.8% | 2.8% | 0% |
| Developer & tech | 5.6% | 2.8% | 5.6% | 0% |
| Search engines | 2.9% | 2.9% | 2.9% | 17.6% |
| Telecom | 2.8% | 0% | 2.8% | 0% |
5 dated, citable findings from the Tranco top 1,000,000 robots.txt scan.
9.33%
fully block at least one major AI crawler (14.8% of the 63.05% that publish a robots.txt). The overwhelming majority of the web leaves the AI crawlers a clear path.
7.4% ≈ 7.23%
GPTBot ≈ CCBot. OpenAI’s crawler is the most-blocked, but Common Crawl’s CCBot — the open corpus most models train on indirectly — is blocked almost exactly as hard.
9.13–13.52%
Concentrated at the head. ~13.52% of the top 10,000 block an AI crawler, fading to 9.13% across the long tail — the inverse of anti-bot/WAF adoption, which climbs steadily down the ranking.
6.9% vs 1.1%
Train on me ≠ send me users. Training crawlers are blocked ~6.3× more than the AI search / assistant agents that return referral traffic.
7.19%
slam the door on every bot with a blanket User-agent: * / Disallow: / — overwhelmingly the web’s parked, expired and placeholder domains, which shut out AI crawlers as collateral.
Same universe as the Anti-Bot Adoption Index— so you can join “does it run a WAF?” against “does it block AI in robots.txt?”
The same figures behind the charts, as plain HTML tables — easy to copy, and machine-readable for search engines and AI answer engines that can't parse a chart.
| AI crawler | Operator | Purpose | Fully blocks (% of all) | Named (% of all) |
|---|---|---|---|---|
| GPTBot (OpenAI) | OpenAI | AI training | 7.4% | 8.91% |
| CCBot (Common Crawl) | Common Crawl | AI training | 7.23% | 12.5% |
| Bytespider (ByteDance) | ByteDance | AI training | 6.77% | 11.81% |
| ClaudeBot (Anthropic) | Anthropic | AI training | 6.69% | 7.95% |
| Amazonbot | Amazon | Search/assistant | 6.6% | 7.31% |
| Google-Extended | AI training | 6.28% | 7.4% | |
| Meta-ExternalAgent | Meta | AI training | 6.02% | 6.96% |
| Applebot-Extended | Apple | AI training | 5.93% | 6.54% |
| PetalBot | Huawei (Petal) | Search | 1.61% | 6.47% |
| ChatGPT-User | OpenAI | AI assistant | 1.59% | 2.71% |
| anthropic-ai | Anthropic | AI training | 1.51% | 2.23% |
| PerplexityBot | Perplexity | AI search | 1.29% | 2.49% |
| cohere-ai | Cohere | AI training | 1.18% | 1.64% |
| Claude-Web | Anthropic | AI assistant | 1.18% | 1.66% |
| ImageSiftBot | ImageSift / Hive | AI training | 1.15% | 1.3% |
| Omgilibot | Webz.io | AI training | 1.13% | 1.4% |
| YouBot | You.com | AI search | 0.97% | 1.41% |
| FacebookBot | Meta | Crawler | 0.95% | 1.48% |
| Diffbot | Diffbot | Crawler | 0.92% | 1.15% |
| OAI-SearchBot | OpenAI | AI search | 0.68% | 1.65% |
| Rank band | Sites | Block ≥1 AI (% of all) | % of robots-serving |
|---|---|---|---|
| 1-1000 | 1,000 | 12.9% | 22.51% |
| 1001-10000 | 9,000 | 13.52% | 20.9% |
| 10001-100000 | 90,000 | 10.9% | 17.07% |
| 100001-1000000 | 898,497 | 9.13% | 14.5% |
We fetched /robots.txt for each of the Tranco top 1,000,000 (same domain universe as the Anti-Bot Adoption Index), from a datacenter IP, and parsed every User-agent group.
A site “fully blocks” a crawler when its robots.txt names that user-agent (exact, case-insensitive) in a group with Disallow: /. That’s the strict, unambiguous “kept off the whole site” signal — it deliberately excludes partial paths (Disallow: /private) and allow-only mentions, which is why our numbers are lower than studies that count any mention of a bot.
robots.txt is a published request, not an enforced wall — a crawler can ignore it. This measures stated policy, not traffic. We track 20 AI user-agents across OpenAI, Anthropic, Google, Common Crawl, ByteDance, Meta, Amazon, Apple, Perplexity, Cohere and others.