What percentage of websites block AI crawlers?

In an independent 2026 robots.txt scan of the Tranco top 1,000,000, 9.33% of sites fully blocked at least one major AI crawler (a User-agent with Disallow: /). Among the 63.05% that publish a robots.txt at all, 14.8% block one.

Which AI crawler is blocked the most?

GPTBot (OpenAI) is the single most-blocked at 7.4% of all sites, with Common Crawl's CCBot a close second at 7.23% — the open training corpus is blocked nearly as hard as OpenAI's own crawler. Bytespider, ClaudeBot and Google-Extended follow.

Do sites block AI training crawlers more than AI search crawlers?

Yes. Across the Tranco top 1,000,000, AI training crawlers (GPTBot, CCBot, ClaudeBot, Google-Extended, Bytespider) are blocked on about 6.9% of sites on average, while AI search / assistant agents that return traffic (PerplexityBot, ChatGPT-User, OAI-SearchBot) are blocked on only about 1.1%. Sites distinguish "train on me" from "send me users".

Do the biggest sites block AI crawlers more than smaller ones?

Yes — blocking is concentrated among the most-trafficked sites: about 13.52% of the top 10,000 block at least one AI crawler, fading to 9.13% across the long tail. That is the inverse of managed anti-bot/WAF adoption, which climbs as you go down the ranking — the sites most worth scraping are the ones that bothered to add the line.

Data study · updated June 20, 2026

Who actually blocks the AI crawlers?

We fetched the robots.txt of the Tranco top 1,000,000 and recorded which AI crawlers each site disallows. The headline: almost nobodyblocks them — and blocking is concentrated among the busiest sites, fading into the long tail. GPTBot leads, but Common Crawl’s CCBot is right behind it.

9.33%

of the 998,497 sites fully block at least one major AI crawler in robots.txt — just 14.8% of those that publish one.

7.4%

block GPTBot — the single most-blocked AI crawler

7.19%

block every bot via * / Disallow: /

Independent robots.txt scan of the Tranco top 1,000,000 (June 20, 2026). “Fully blocks” = the crawler’s user-agent named with Disallow: /.

The leaderboard

GPTBot leads — but the open corpus is blocked just as hard.

Share of the Tranco top 1,000,000 that fully block each AI crawler (Disallow: /). The training crawlers (red) cluster at the top; the AI-search / assistant agents that send traffic back (blue) are blocked far less.

GPTBot (OpenAI)7.4%OpenAI

CCBot (Common Crawl)7.23%Common Crawl

Bytespider (ByteDance)6.77%ByteDance

ClaudeBot (Anthropic)6.69%Anthropic

Amazonbot6.6%Amazon

Google-Extended6.28%Google

Meta-ExternalAgent6.02%Meta

Applebot-Extended5.93%Apple

PetalBot1.61%Huawei (Petal)

ChatGPT-User1.59%OpenAI

anthropic-ai1.51%Anthropic

PerplexityBot1.29%Perplexity

% of all scanned sites that fully block each crawler. Red = AI training crawler, blue = AI search / assistant. Hover a row to isolate it.

GPTBot (7.4%) and CCBot (7.23%) are neck-and-neck at the top. Blocking GPTBot but not CCBot is leaving the front door open: the Common Crawl corpus is what most models train on indirectly.

The training crawlers — GPTBot, CCBot, ClaudeBot, Google-Extended, Bytespider — sit at ~6.9% on average. The search / assistant agents that return clicks — PerplexityBot, ChatGPT-User, OAI-SearchBot — are blocked on only ~1.1%.

“Fully blocks” counts only a Disallow: /under the crawler’s own user-agent. Many more sites name a crawler with a narrower rule — see the table below.

By traffic rank

AI-blocking is concentrated at the top of the web.

Share of each Tranco rank band that fully blocks at least one AI crawler. It peaks among the most-trafficked sites (~13.52% of the top 10,000) and fades into the long tail (~9.13%) — the inverse of managed anti-bot/WAF adoption, which climbs as you go down the ranking. The sites most worth scraping are the ones that bothered to add the line.

Block ≥1 AI crawler12.9% → 9.13%% of robots-serving22.51% → 14.5%

1-1000

12.9%

22.51%

1001-10000

13.52%

20.9%

10001-100000

10.9%

17.07%

100001-1000000

9.13%

14.5%

% of sites in each Tranco rank band that fully block at least one AI crawler. Hover a band to read it off.

Who blocks — by category

The web that lives on its content blocks the most.

Among major sites in each vertical the split is sharp: news, media, reference and community sites — whose archive is the asset — block AI crawlers heavily, while transactional and functional sites (commerce, finance, SaaS, government, search) barely bother. The categories that most want to be cited block the least.

News & media80.6%

Entertainment69.4%

Reference & wiki44.4%

Sports41.7%

Social media33.3%

Healthcare33.3%

Forums & community33.3%

Food & delivery30.6%

Classifieds27.8%

Gaming27.8%

Music & audio25.7%

Education25%

Other25%

Marketplaces22.2%

Video & streaming19.4%

Real estate19.4%

Jobs & recruiting13.9%

Automotive13.9%

E-commerce11.1%

Crypto11.1%

SaaS & business11.1%

AI tools11.1%

Travel & hospitality8.3%

Finance & markets8.3%

Government8.3%

Developer & tech5.6%

Search engines2.9%

Telecom2.8%

% of major sites in each vertical (~36 each) that fully block ≥1 AI crawler. Red = content / media / community; blue = transactional / functional. Hover a row to isolate it.

News & media leads at 80.6%— and blocks CCBot (66.7%) harder than GPTBot (47.2%): publishers defend the Common Crawl training corpus first.

Social platforms score high too, but mostly via a blanket 30.6% * / Disallow: / wall — login-gated to protect user data, catching AI as collateral. News writes AI-specific rules instead.

Search engines (2.9%), government and developer sites block the least — their whole point is to be indexed and cited.

Rates are for the ~36 major sites in each vertical, so they run higher than the 9.33% whole-1M average. The ranking across categories is the robust signal.

Blocking by category — full table

Category	Block ≥1 AI (%)	GPTBot %	CCBot %	Blocks all (*) %
News & media	80.6%	47.2%	66.7%	8.3%
Entertainment	69.4%	50%	61.1%	2.8%
Reference & wiki	44.4%	27.8%	27.8%	5.6%
Sports	41.7%	30.6%	30.6%	2.8%
Social media	33.3%	30.6%	13.9%	30.6%
Healthcare	33.3%	16.7%	30.6%	0%
Forums & community	33.3%	30.6%	22.2%	8.3%
Food & delivery	30.6%	11.1%	25%	5.6%
Classifieds	27.8%	13.9%	27.8%	8.3%
Gaming	27.8%	13.9%	16.7%	0%
Music & audio	25.7%	22.9%	25.7%	5.7%
Education	25%	19.4%	13.9%	2.8%
Other	25%	13.9%	19.4%	2.8%
Marketplaces	22.2%	19.4%	13.9%	2.8%
Video & streaming	19.4%	13.9%	11.1%	8.3%
Real estate	19.4%	8.3%	2.8%	0%
Jobs & recruiting	13.9%	8.3%	5.6%	5.6%
Automotive	13.9%	8.3%	8.3%	0%
E-commerce	11.1%	11.1%	5.6%	0%
Crypto	11.1%	5.6%	5.6%	0%
SaaS & business	11.1%	5.6%	8.3%	0%
AI tools	11.1%	2.8%	2.8%	5.6%
Travel & hospitality	8.3%	2.8%	2.8%	0%
Finance & markets	8.3%	0%	2.8%	2.8%
Government	8.3%	2.8%	2.8%	0%
Developer & tech	5.6%	2.8%	5.6%	0%
Search engines	2.9%	2.9%	2.9%	17.6%
Telecom	2.8%	0%	2.8%	0%

The takeaways

What the data says.

5 dated, citable findings from the Tranco top 1,000,000 robots.txt scan.

9.33%

fully block at least one major AI crawler (14.8% of the 63.05% that publish a robots.txt). The overwhelming majority of the web leaves the AI crawlers a clear path.

7.4% ≈ 7.23%

GPTBot ≈ CCBot. OpenAI’s crawler is the most-blocked, but Common Crawl’s CCBot — the open corpus most models train on indirectly — is blocked almost exactly as hard.

9.13–13.52%

Concentrated at the head. ~13.52% of the top 10,000 block an AI crawler, fading to 9.13% across the long tail — the inverse of anti-bot/WAF adoption, which climbs steadily down the ranking.

6.9% vs 1.1%

Train on me ≠ send me users. Training crawlers are blocked ~6.3× more than the AI search / assistant agents that return referral traffic.

7.19%

slam the door on every bot with a blanket User-agent: * / Disallow: / — overwhelmingly the web’s parked, expired and placeholder domains, which shut out AI crawlers as collateral.

Same universe as the Anti-Bot Adoption Index— so you can join “does it run a WAF?” against “does it block AI in robots.txt?”

Reference tables

The numbers, as tables.

The same figures behind the charts, as plain HTML tables — easy to copy, and machine-readable for search engines and AI answer engines that can't parse a chart.

Per-crawler blocking — Tranco top 1,000,000

AI crawler	Operator	Purpose	Fully blocks (% of all)	Named (% of all)
GPTBot (OpenAI)	OpenAI	AI training	7.4%	8.91%
CCBot (Common Crawl)	Common Crawl	AI training	7.23%	12.5%
Bytespider (ByteDance)	ByteDance	AI training	6.77%	11.81%
ClaudeBot (Anthropic)	Anthropic	AI training	6.69%	7.95%
Amazonbot	Amazon	Search/assistant	6.6%	7.31%
Google-Extended	Google	AI training	6.28%	7.4%
Meta-ExternalAgent	Meta	AI training	6.02%	6.96%
Applebot-Extended	Apple	AI training	5.93%	6.54%
PetalBot	Huawei (Petal)	Search	1.61%	6.47%
ChatGPT-User	OpenAI	AI assistant	1.59%	2.71%
anthropic-ai	Anthropic	AI training	1.51%	2.23%
PerplexityBot	Perplexity	AI search	1.29%	2.49%
cohere-ai	Cohere	AI training	1.18%	1.64%
Claude-Web	Anthropic	AI assistant	1.18%	1.66%
ImageSiftBot	ImageSift / Hive	AI training	1.15%	1.3%
Omgilibot	Webz.io	AI training	1.13%	1.4%
YouBot	You.com	AI search	0.97%	1.41%
FacebookBot	Meta	Crawler	0.95%	1.48%
Diffbot	Diffbot	Crawler	0.92%	1.15%
OAI-SearchBot	OpenAI	AI search	0.68%	1.65%

Blocking by Tranco rank band

Rank band	Sites	Block ≥1 AI (% of all)	% of robots-serving
1-1000	1,000	12.9%	22.51%
1001-10000	9,000	13.52%	20.9%
10001-100000	90,000	10.9%	17.07%
100001-1000000	898,497	9.13%	14.5%

Method

How we measured it.

We fetched /robots.txt for each of the Tranco top 1,000,000 (same domain universe as the Anti-Bot Adoption Index), from a datacenter IP, and parsed every User-agent group.

A site “fully blocks” a crawler when its robots.txt names that user-agent (exact, case-insensitive) in a group with Disallow: /. That’s the strict, unambiguous “kept off the whole site” signal — it deliberately excludes partial paths (Disallow: /private) and allow-only mentions, which is why our numbers are lower than studies that count any mention of a bot.

robots.txt is a published request, not an enforced wall — a crawler can ignore it. This measures stated policy, not traffic. We track 20 AI user-agents across OpenAI, Anthropic, Google, Common Crawl, ByteDance, Meta, Amazon, Apple, Perplexity, Cohere and others.

Keep going

Use the data.

Companion study: how much of the web runs an anti-bot / WAF? →Companion study: how much of the web is actually dead? →Download the full per-domain dataset (CC BY 4.0) →Scrape protected sites at scale, pay on success →

AI-blocking is concentrated at the top of the web.

The web that lives on its content blocks the most.