Crawlora
ProductPlatformsUse CasesDocsPricingCompareContact
Sign inTry Playground Console
Crawlora

Structured public web data APIs for search, maps, geocoding, streaming, travel, real estate, marketplaces, apps, social, audio, crypto, finance, and AI workflows with managed execution and credit-based usage.

Product

Web Scraping APIFor AI AgentsFeaturesPlatformsTravel APIsReal Estate APIsPricingReferral Program

Platforms

Google SearchGoogle MapsGoogle TrendsBing SearchAmazonLinkedInApple PodcastsZillowTripAdvisorShopifyAll platforms

Developers

DocsGetting StartedAPI ExamplesPlaygroundSDKsGitHub

Use cases

SERP MonitoringSERP Rank Checker APIGoogle Maps LeadsProperty Market IntelligenceAmazon Product MonitoringCrypto Market ResearchAI Agent Web DataAll use cases

Resources

Free Web ScraperAnti-Bot CheckerDead-Web IndexKeyword ResearchBlogChangelogAll free tools

Legal

ContactTermsPrivacy
Product
Web Scraping APIFor AI AgentsFeaturesPlatformsTravel APIsReal Estate APIsPricingReferral Program
Platforms
Google SearchGoogle MapsGoogle TrendsBing SearchAmazonLinkedInApple PodcastsZillowTripAdvisorShopifyAll platforms
Developers
DocsGetting StartedAPI ExamplesPlaygroundSDKsGitHub
Use cases
SERP MonitoringSERP Rank Checker APIGoogle Maps LeadsProperty Market IntelligenceAmazon Product MonitoringCrypto Market ResearchAI Agent Web DataAll use cases
Resources
Free Web ScraperAnti-Bot CheckerDead-Web IndexKeyword ResearchBlogChangelogAll free tools
Legal
ContactTermsPrivacy
© 2026 Crawlora. All rights reserved.·Built by Tony Wang
System statusCrawlora API status
  1. Home
  2. /AI-Crawler Blocking Index

Data study · updated June 20, 2026

Who actually blocks the AI crawlers?

We fetched the robots.txt of the Tranco top 1,000,000 and recorded which AI crawlers each site disallows. The headline: almost nobodyblocks them — and blocking is concentrated among the busiest sites, fading into the long tail. GPTBot leads, but Common Crawl’s CCBot is right behind it.

9.33%

of the 998,497 sites fully block at least one major AI crawler in robots.txt — just 14.8% of those that publish one.

7.4%

block GPTBot — the single most-blocked AI crawler

7.19%

block every bot via * / Disallow: /

Independent robots.txt scan of the Tranco top 1,000,000 (June 20, 2026). “Fully blocks” = the crawler’s user-agent named with Disallow: /.

The leaderboard

GPTBot leads — but the open corpus is blocked just as hard.

Share of the Tranco top 1,000,000 that fully block each AI crawler (Disallow: /). The training crawlers (red) cluster at the top; the AI-search / assistant agents that send traffic back (blue) are blocked far less.

GPTBot (OpenAI)7.4%OpenAI
CCBot (Common Crawl)7.23%Common Crawl
Bytespider (ByteDance)6.77%ByteDance
ClaudeBot (Anthropic)6.69%Anthropic
Amazonbot6.6%Amazon
Google-Extended6.28%Google
Meta-ExternalAgent6.02%Meta
Applebot-Extended5.93%Apple
PetalBot1.61%Huawei (Petal)
ChatGPT-User1.59%OpenAI
anthropic-ai1.51%Anthropic
PerplexityBot1.29%Perplexity
% of all scanned sites that fully block each crawler. Red = AI training crawler, blue = AI search / assistant. Hover a row to isolate it.

GPTBot (7.4%) and CCBot (7.23%) are neck-and-neck at the top. Blocking GPTBot but not CCBot is leaving the front door open: the Common Crawl corpus is what most models train on indirectly.

The training crawlers — GPTBot, CCBot, ClaudeBot, Google-Extended, Bytespider — sit at ~6.9% on average. The search / assistant agents that return clicks — PerplexityBot, ChatGPT-User, OAI-SearchBot — are blocked on only ~1.1%.

“Fully blocks” counts only a Disallow: /under the crawler’s own user-agent. Many more sites name a crawler with a narrower rule — see the table below.

By traffic rank

AI-blocking is concentrated at the top of the web.

Share of each Tranco rank band that fully blocks at least one AI crawler. It peaks among the most-trafficked sites (~13.52% of the top 10,000) and fades into the long tail (~9.13%) — the inverse of managed anti-bot/WAF adoption, which climbs as you go down the ranking. The sites most worth scraping are the ones that bothered to add the line.

Block ≥1 AI crawler12.9% → 9.13%% of robots-serving22.51% → 14.5%
0%15%30%
1-1000
12.9%
22.51%
1001-10000
13.52%
20.9%
10001-100000
10.9%
17.07%
100001-1000000
9.13%
14.5%
% of sites in each Tranco rank band that fully block at least one AI crawler. Hover a band to read it off.
Who blocks — by category

The web that lives on its content blocks the most.

Among major sites in each vertical the split is sharp: news, media, reference and community sites — whose archive is the asset — block AI crawlers heavily, while transactional and functional sites (commerce, finance, SaaS, government, search) barely bother. The categories that most want to be cited block the least.

News & media80.6%
Entertainment69.4%
Reference & wiki44.4%
Sports41.7%
Social media33.3%
Healthcare33.3%
Forums & community33.3%
Food & delivery30.6%
Classifieds27.8%
Gaming27.8%
Music & audio25.7%
Education25%
Other25%
Marketplaces22.2%
Video & streaming19.4%
Real estate19.4%
Jobs & recruiting13.9%
Automotive13.9%
E-commerce11.1%
Crypto11.1%
SaaS & business11.1%
AI tools11.1%
Travel & hospitality8.3%
Finance & markets8.3%
Government8.3%
Developer & tech5.6%
Search engines2.9%
Telecom2.8%
% of major sites in each vertical (~36 each) that fully block ≥1 AI crawler. Red = content / media / community; blue = transactional / functional. Hover a row to isolate it.

News & media leads at 80.6%— and blocks CCBot (66.7%) harder than GPTBot (47.2%): publishers defend the Common Crawl training corpus first.

Social platforms score high too, but mostly via a blanket 30.6% * / Disallow: / wall — login-gated to protect user data, catching AI as collateral. News writes AI-specific rules instead.

Search engines (2.9%), government and developer sites block the least — their whole point is to be indexed and cited.

Rates are for the ~36 major sites in each vertical, so they run higher than the 9.33% whole-1M average. The ranking across categories is the robust signal.

Blocking by category — full table
CategoryBlock ≥1 AI (%)GPTBot %CCBot %Blocks all (*) %
News & media80.6%47.2%66.7%8.3%
Entertainment69.4%50%61.1%2.8%
Reference & wiki44.4%27.8%27.8%5.6%
Sports41.7%30.6%30.6%2.8%
Social media33.3%30.6%13.9%30.6%
Healthcare33.3%16.7%30.6%0%
Forums & community33.3%30.6%22.2%8.3%
Food & delivery30.6%11.1%25%5.6%
Classifieds27.8%13.9%27.8%8.3%
Gaming27.8%13.9%16.7%0%
Music & audio25.7%22.9%25.7%5.7%
Education25%19.4%13.9%2.8%
Other25%13.9%19.4%2.8%
Marketplaces22.2%19.4%13.9%2.8%
Video & streaming19.4%13.9%11.1%8.3%
Real estate19.4%8.3%2.8%0%
Jobs & recruiting13.9%8.3%5.6%5.6%
Automotive13.9%8.3%8.3%0%
E-commerce11.1%11.1%5.6%0%
Crypto11.1%5.6%5.6%0%
SaaS & business11.1%5.6%8.3%0%
AI tools11.1%2.8%2.8%5.6%
Travel & hospitality8.3%2.8%2.8%0%
Finance & markets8.3%0%2.8%2.8%
Government8.3%2.8%2.8%0%
Developer & tech5.6%2.8%5.6%0%
Search engines2.9%2.9%2.9%17.6%
Telecom2.8%0%2.8%0%
The takeaways

What the data says.

5 dated, citable findings from the Tranco top 1,000,000 robots.txt scan.

9.33%

fully block at least one major AI crawler (14.8% of the 63.05% that publish a robots.txt). The overwhelming majority of the web leaves the AI crawlers a clear path.

7.4% ≈ 7.23%

GPTBot ≈ CCBot. OpenAI’s crawler is the most-blocked, but Common Crawl’s CCBot — the open corpus most models train on indirectly — is blocked almost exactly as hard.

9.13–13.52%

Concentrated at the head. ~13.52% of the top 10,000 block an AI crawler, fading to 9.13% across the long tail — the inverse of anti-bot/WAF adoption, which climbs steadily down the ranking.

6.9% vs 1.1%

Train on me ≠ send me users. Training crawlers are blocked ~6.3× more than the AI search / assistant agents that return referral traffic.

7.19%

slam the door on every bot with a blanket User-agent: * / Disallow: / — overwhelmingly the web’s parked, expired and placeholder domains, which shut out AI crawlers as collateral.

Same universe as the Anti-Bot Adoption Index— so you can join “does it run a WAF?” against “does it block AI in robots.txt?”

Reference tables

The numbers, as tables.

The same figures behind the charts, as plain HTML tables — easy to copy, and machine-readable for search engines and AI answer engines that can't parse a chart.

Per-crawler blocking — Tranco top 1,000,000
AI crawlerOperatorPurposeFully blocks (% of all)Named (% of all)
GPTBot (OpenAI)OpenAIAI training7.4%8.91%
CCBot (Common Crawl)Common CrawlAI training7.23%12.5%
Bytespider (ByteDance)ByteDanceAI training6.77%11.81%
ClaudeBot (Anthropic)AnthropicAI training6.69%7.95%
AmazonbotAmazonSearch/assistant6.6%7.31%
Google-ExtendedGoogleAI training6.28%7.4%
Meta-ExternalAgentMetaAI training6.02%6.96%
Applebot-ExtendedAppleAI training5.93%6.54%
PetalBotHuawei (Petal)Search1.61%6.47%
ChatGPT-UserOpenAIAI assistant1.59%2.71%
anthropic-aiAnthropicAI training1.51%2.23%
PerplexityBotPerplexityAI search1.29%2.49%
cohere-aiCohereAI training1.18%1.64%
Claude-WebAnthropicAI assistant1.18%1.66%
ImageSiftBotImageSift / HiveAI training1.15%1.3%
OmgilibotWebz.ioAI training1.13%1.4%
YouBotYou.comAI search0.97%1.41%
FacebookBotMetaCrawler0.95%1.48%
DiffbotDiffbotCrawler0.92%1.15%
OAI-SearchBotOpenAIAI search0.68%1.65%
Blocking by Tranco rank band
Rank bandSitesBlock ≥1 AI (% of all)% of robots-serving
1-10001,00012.9%22.51%
1001-100009,00013.52%20.9%
10001-10000090,00010.9%17.07%
100001-1000000898,4979.13%14.5%
Method

How we measured it.

We fetched /robots.txt for each of the Tranco top 1,000,000 (same domain universe as the Anti-Bot Adoption Index), from a datacenter IP, and parsed every User-agent group.

A site “fully blocks” a crawler when its robots.txt names that user-agent (exact, case-insensitive) in a group with Disallow: /. That’s the strict, unambiguous “kept off the whole site” signal — it deliberately excludes partial paths (Disallow: /private) and allow-only mentions, which is why our numbers are lower than studies that count any mention of a bot.

robots.txt is a published request, not an enforced wall — a crawler can ignore it. This measures stated policy, not traffic. We track 20 AI user-agents across OpenAI, Anthropic, Google, Common Crawl, ByteDance, Meta, Amazon, Apple, Perplexity, Cohere and others.

Keep going

Use the data.

Companion study: how much of the web runs an anti-bot / WAF? →Companion study: how much of the web is actually dead? →Download the full per-domain dataset (CC BY 4.0) →Scrape protected sites at scale, pay on success →