Crawlora
ProductPlatformsUse CasesDocsPricingCompareContact
Sign inTry Playground Console
Crawlora

Structured public web data APIs for search, maps, geocoding, streaming, travel, real estate, marketplaces, apps, social, audio, crypto, finance, and AI workflows with managed execution and credit-based usage.

Product

Web Scraping APIFor AI AgentsFeaturesPlatformsTravel APIsReal Estate APIsPricingReferral Program

Platforms

Google SearchGoogle MapsGoogle TrendsBing SearchAmazonLinkedInApple PodcastsZillowTripAdvisorShopifyAll platforms

Developers

DocsGetting StartedAPI ExamplesPlaygroundSDKsGitHub

Use cases

SERP MonitoringSERP Rank Checker APIGoogle Maps LeadsProperty Market IntelligenceAmazon Product MonitoringCrypto Market ResearchAI Agent Web DataAll use cases

Resources

Free Web ScraperAnti-Bot CheckerDead-Web IndexKeyword ResearchBlogChangelogAll free tools

Legal

ContactTermsPrivacy
Product
Web Scraping APIFor AI AgentsFeaturesPlatformsTravel APIsReal Estate APIsPricingReferral Program
Platforms
Google SearchGoogle MapsGoogle TrendsBing SearchAmazonLinkedInApple PodcastsZillowTripAdvisorShopifyAll platforms
Developers
DocsGetting StartedAPI ExamplesPlaygroundSDKsGitHub
Use cases
SERP MonitoringSERP Rank Checker APIGoogle Maps LeadsProperty Market IntelligenceAmazon Product MonitoringCrypto Market ResearchAI Agent Web DataAll use cases
Resources
Free Web ScraperAnti-Bot CheckerDead-Web IndexKeyword ResearchBlogChangelogAll free tools
Legal
ContactTermsPrivacy
© 2026 Crawlora. All rights reserved.·Built by Tony Wang
System statusCrawlora API status
  1. Home
  2. /Blog
  3. /14% of the Web Is Actually Dead
By Tony WangTony WangJune 14, 202622 min read

14% of the Web Is Actually Dead

Only 14% of the top 10 million domains are genuinely dead — not the usual 27.6%. Most 'dead' sites are just blocking bots or serving errors.

Data StudyAnti-BotWeb Scraping API

Key takeaways

  • We probed 9,992,781 of the top 10 million domains in June 2026. 14.2% are genuinely dead — no DNS, no connection, nothing answers — not the 27.6% a naive crawl of the same list reports.
  • Most of 'dead' was never dead. 8.9% of the top web (891,672 sites) answers but blocks an automated client (403/429/anti-bot), and another ~4% serves a 404 or 5xx from a live server. Naive crawls count all of that as death.
  • The genuinely dead web is mostly DNS that no longer resolves: 1,077,715 domains — 76% of all dead — have left DNS entirely. The rest refuse or reset the connection. A 404 page is not death; a missing DNS record is.
  • Death is uneven by TLD. .cn (33%), .info (28%), .in and .gov (26%), and .edu (22%) rot fastest — institutional and cheap-registration domains lead, echoing Pew's finding that government and reference pages suffer the worst link rot. .com sits near the 14% line.
  • This is not 'link rot' or 'dead internet theory.' We measure whether the domain itself still resolves and answers — a different question from broken links inside pages (Pew, Ahrefs) or AI-generated content flooding the web.

You have probably seen the stat: 27.6% of the web is dead. It comes from a 2024 crawl of the top 10 million domains, and it gets repeated because it is striking and a little bleak. We ran that study. And when we re-scanned the same 10-million-domain list in 2026 — this time separating a domain that is genuinely gone from one that is merely refusing a bot — the real number came out at 14.2%.

The web didn't suddenly heal. The original number was counting the wrong things. A naive crawler can't tell a dead domain from a live one hiding behind Cloudflare, and it counts a server that politely returns "404 Not Found" the same as one that never answers at all. Fix the classification and roughly half of "dead" turns out to be alive — it just wasn't talking to a bot. Here is the full picture, from 9,992,781 probed domains.

What the outcome of a 10-million-domain scan actually looks like

Every domain gets one of four labels. alive means it answered (a 2xx, or even a 404/5xx — the server is up). blocked means it answered but refused our automated client (a 403, 429, or anti-bot challenge). redirect means it bounced somewhere we couldn't resolve. dead means it never answered at all — no DNS record, or nothing accepts a connection.

Alive76.6% · 7,655,028Blocked8.9% · 891,672Dead14.2% · 1,414,788Redirect0.3% · 31,293
9,992,781 of the top 10 million domains, probed as a polite bot from a datacenter IP, June 2026. Hover a segment to isolate it.

Three-quarters of the top web is alive and answering. The interesting part is the bottom 23% — the slice everyone argues about — and how you split it.

The real number: 14% dead, not 27.6%

Same list, same scale, one difference: in 2026 we refuse to call a domain dead just because our bot couldn't read it. A genuinely dead domain fails early — DNS returns nothing, or the connection is refused. A live-but-defended domain fails late, with a 403 or a challenge page, which is a completely different signal. Counting honestly moves the headline from 27.6% to 14.2%.

Naive 2024 crawl — counted as dead27.6%

DNS failure, anti-bot 403s, served 404/5xx and timeouts all lumped together

Honest 2026 classification — actually dead14.2%

No DNS, connection refused, or nothing accepts a connection

The same top-10-million list, classified two ways. The 13-point gap is anti-bot blocking and answered errors counted as death.

Where do the missing ~13 points go? Almost all of it is two things a naive crawl mislabels:

  • 8.9% (891,672 sites) answer but block bots. A 403, a 429, or a Cloudflare "Just a moment" challenge to a datacenter IP. These are some of the most alive sites on the web — they run active defenses precisely because people want their data.
  • ~4% serve a 404 or 5xx from a live server. A "404 Not Found" or a "503 Service Unavailable" is proof the host answered. The original crawl counted them as dead; a server that returns an error is the opposite of gone.

The remainder is a 2024 measurement artifact: that crawl resolved each domain through a single DNS resolver, and a flaky lookup falsely marked resolvable domains dead. We now cross-check across resolvers before declaring a DNS failure.

Dead means unreachable, not 'returned an error.'

The whole correction rests on one rule: a server that answers anything — even a 404, a 500, or a 403 — is up, so it isn't dead. Only a domain that no DNS resolver can find, or that refuses and resets every connection, is dead. Most "dead web" counts skip this and inflate the number by half.

What a no-follow crawler gets wrong

The gap between 27.6% and 14.2% is largely a measurement choice: whether you follow redirects and read what the server actually says. A crawler that stops at the first response sees only 45.9% return a clean 200 and writes off the rest. Follow the redirects and read the bodies, and 71.9% are alive. Here is where every first response actually ends up:

200 OK 4.6M3xx redirect 3.1MNo response 1.5M403 / 429 411K404 237K5xx 86KAlive 7.6MBlocked 881KDead 1.4MRedirect 31K
Where each first response actually ends up (top 10M, 2026). A no-follow crawler counts the whole 3xx band as 'not 200' — but most of it resolves to a live page.
Show the flows
200 OK → Alive4,584,611 (46.3%)
3xx redirect → Alive2,677,304 (27%)
No response → Dead1,413,013 (14.3%)
403 / 429 → Blocked410,511 (4.1%)
3xx redirect → Blocked365,368 (3.7%)
404 → Alive236,685 (2.4%)
No response → Blocked105,222 (1.1%)
5xx → Alive85,728 (0.9%)
3xx redirect → Redirect31,267 (0.3%)
3xx redirect → Dead1,775 (0%)

The big rivers carry the point: a 301 is not a dead end — 87% of redirects resolve to a live page, and a 403 or 429 is a live site refusing a bot, not a corpse. The only response that reliably means dead is no response at all — and that single No response → Dead band is almost the entire dead web.

The genuinely dead web is mostly DNS that's gone

So what is the 14.2%? Overwhelmingly, it's domains that have left DNS entirely. Of the 1,414,788 genuinely dead domains, 1,077,715 — about 76% — no longer resolve to any IP at all. The registration lapsed, the zone was deleted, the project was abandoned. The rest refuse or reset every connection, or fail TLS to a host that is truly down. A dead domain almost never answers and errors — it simply isn't there.

This matters if you build anything that follows links or crawls a list: the failures you'll actually hit are split between "this domain is gone" (retry never helps) and "this site is blocking me" (a different request gets in). Treating them the same is the single most common way web-health numbers get inflated — and the most common way a scraper wastes a budget retrying domains that will never answer.

The famous dead

Aggregate percentages are abstract. So we sorted the genuinely-dead domains by popularity rank and went looking for names you'd recognise — and the graveyard is remarkable. The single highest-ranked dead domain in the entire top 10 million makes the point on its own.

At #568 sits fanlink.to, the music "smart-link" service artists and labels used for pre-save and streaming links. In March 2024 its parent — Eventbrite's ToneDen — lost control of the .to domain and never recovered it, instantly breaking millions of links sitting in artist bios, ads, and press releases.

Which raises the obvious question: how is a dead domain the 568th most popular on the web? Because the web never stopped knocking. Every un-updated link, embed, and bookmark keeps firing requests at an address that no longer answers — the rank is a fossil of past popularity. That is precisely why a popularity-ranked list is full of corpses at all.

Music & video

  • fanlink.to† 2024

    Music smart-links · ToneDen / Eventbrite

    The single highest-ranked dead domain in the whole top 10M (#568). In March 2024 Eventbrite lost control of the .to domain overnight, instantly breaking millions of artists' pre-save and streaming links sitting in bios, ads, and press releases. Wayback ↗

  • grooveshark.com† 2015

    Free music streaming · ~20M users

    Forced shut by the major labels' copyright suit (willful infringement, ~$700M of exposure). The entire catalogue was wiped the day the settlement landed; a co-founder died months later at 28. Wayback ↗

  • rdio.com† 2015

    Music subscription service

    Bankrupt after burning ~$2M a month. Pandora bought the technology for $75M and shut the service down the day before the sale closed. Wayback ↗

  • gfycat.com† 2023

    GIF host for Reddit & Discord · ~220M users

    Bought by Snap in 2020, then switched off as a non-core asset — one of the largest single link-rot events ever, breaking millions of embedded GIFs across the web. Wayback ↗

  • veoh.com† 2024

    Video-sharing site

    Won a landmark DMCA case that helped protect every YouTube-style site, limped on for years under Japan's FC2, and finally went dark in November 2024. Wayback ↗

  • metacafe.com† 2021

    Top-3 video site of 2006

    One of YouTube's first serious rivals — it simply went offline one day in 2021 with no announcement at all. Wayback ↗

The social web

  • del.icio.us† 2017

    Delicious · invented social bookmarking

    The site that coined web-scale tagging. Passed through five owners (Yahoo → AVOS → Science → Delicious Media → Pinboard for $35,000) before going read-only. Wayback ↗

  • dmoz.org† 2017

    The Open Directory · a human-curated map of the web

    91,000 volunteers cataloguing 3.8M sites — once a near-prerequisite for SEO, then made obsolete by Google's algorithm. Lives on as the community fork Curlie. Wayback ↗

  • pipes.yahoo.com† 2015

    Yahoo Pipes · visual no-code data mashups

    The “Zapier of 2007.” Killed in a Yahoo cost-cut; thousands of live RSS and data pipelines broke on the same day. Wayback ↗

  • topsy.com† 2015

    The only full historical Twitter search

    Indexed hundreds of billions of tweets back to 2006. Apple bought it for ~$200M and quietly switched it off two years later; the searchable archive simply vanished. Wayback ↗

  • aviary.com† 2018

    Photo-editing SDK embedded in 7,000+ apps

    Powered in-app photo editing across the mobile economy (10B edits). Adobe acquired it, folded the tech into Creative Cloud, then sunset the free SDK. Wayback ↗

The developer web

  • s7.addthis.com† 2023

    Share buttons + tracking on 15M websites

    Oracle bought it for the behavioural data, then killed it under GDPR pressure — a single shutdown darkened share widgets across millions of sites at once. Wayback ↗

  • programmableweb.com† 2023

    The public directory of ~19,000 web APIs

    The index of the “API economy” for 17 years. Salesforce / MuleSoft erased the whole thing with no archive. Wayback ↗

  • securityfocus.com† 2021

    Home of the Bugtraq disclosure list (since 1993)

    The security world's noticeboard for nearly 30 years. Symantec → Broadcom → Accenture let it freeze; the Bugtraq archive survives only at seclists.org. Wayback ↗

  • opensolaris.org† 2013

    Sun's open-source operating system

    Oracle froze it the moment it bought Sun and pulled the domain in 2013. The community kept the code alive as the illumos fork. Wayback ↗

  • sorbs.net† 2024

    Spam blocklist covering 512M IP addresses

    A DNS blocklist that mail servers queried for over two decades. Proofpoint pulled the plug in 2024; servers worldwide still query a list that no longer answers. Wayback ↗

Government & institutions

  • patft.uspto.gov† 2022

    US patent full-text search (1790–present)

    Retired for a new search tool — breaking decades of direct patent links embedded in academic papers, legal briefs, and analysis tools. Wayback ↗

  • petitions.whitehouse.gov† 2021

    Obama's “We the People” e-petitions

    A petition once topped a million signatures. The platform was quietly discontinued on Inauguration Day 2021 and never revived. Wayback ↗

  • weblogs.com† ~2009

    Dave Winer's blog-ping server · the early blogosphere's heartbeat

    Every new blog post once pinged this host; VeriSign paid $2.3M for it. It faded after 2009 — yet old WordPress installs still ping the dead address to this day. Wayback ↗

  • europa.eu.int† 2006

    The European Union's original web address

    The canonical home of EU law and institutions for over a decade. Migrated to europa.eu on Europe Day 2006, stranding a generation of links. Wayback ↗

Read those twenty obituaries back-to-back and one cause of death stands out: being acquired. Seven of the twenty were bought by a bigger company that then switched them off — Snap killed Gfycat, Apple killed Topsy, Oracle killed both AddThis and OpenSolaris, Adobe killed Aviary, Salesforce killed ProgrammableWeb, Broadcom let SecurityFocus rot. "Acqui-killed" beats bankruptcy, lawsuits, and neglect combined.

Acquired, then killed7
Strategic shutdown / cost-cut5
Neglect / abandoned domain4
Bankruptcy or lawsuit2
Migrated / retired elsewhere2
How twenty of the web's most famous dead domains actually died. Acquisition is the leading cause — more than bankruptcy, lawsuits, and neglect together.

Twenty headliners can't show the shape of the whole graveyard. So we widened the lens — pulling ~100 widely-recognised, verifiable shutdowns (from this scan's dead domains and the public record), dating each to the year its service ended and sorting them into six corners of the web. Stacked by year, two decades of the dying web look like this:

Social & communityDeveloper & infrastructureMusic & videoSearch & referenceMedia & newsCommerce & government
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
Show the data
Topic200620072008200920102011201220132014201520162017201820192020202120222023202420252026
Social & community000112421212412113110
Developer & infrastructure000100313244211204110
Music & video000100100232103202200
Search & reference001100020101011110001
Media & news000000000110000110101
Commerce & government100000000101010110000
When ~100 notable websites died, by category and year. A curated set of widely-recognised shutdowns — drawn from this scan's dead domains plus public records, dated to the year each service ended — so it's illustrative of the eras, not an exhaustive census. Hover a band to isolate it; 'Show the data' for the numbers.

Two things stand out. Social platforms and developer tools are the bulk of the dead web — the social graveyard (Friendster, Orkut, Bebo, Google+, Path, Yik Yak, Ello, Digg…) and the dev-tools column (Google Code, Parse, Google Wave, Gitorious, Sunrise, Mailbox…) are dead even, and together they're more than half of everything here. And the deaths cluster: a first swell in 2012–2017 as the Web 2.0 and check-in/anonymous-app generation collapsed, then a second from 2020 as pandemic-era and big-tech bets were cut (Quibi, Mixer, CNN+, Stadia, Google Play Music). Before 2009 the stream barely exists — most of the web simply wasn't old enough to have died yet.

One honest caveat, and the reason we re-checked every domain by hand: a dead domain is not always a dead thing. Some only look dead because the service rebranded or moved — money.yandex.ru became YooMoney, the old suicidepreventionlifeline.org host gave way to 988lifeline.org, the EU's europa.eu.int simply became europa.eu. We re-probed every domain above against live DNS in June 2026 and dropped the false positives (nrel.gov and angelfire.com still resolve fine). What remains genuinely no longer answers.

Death is uneven: which TLDs rot fastest

Dead rate is not evenly spread. Split the 10 million by top-level domain and a clear gradient appears — cheap-registration and institutional TLDs rot far faster than the .com baseline.

.cn33.0% dead42,827 domains
.info28.4% dead61,440 domains
.in25.9% dead63,614 domains
.gov25.9% dead43,435 domains
.edu22.0% dead128,968 domains
.us22.0% dead41,645 domains
.br20.9% dead99,202 domains
.net19.9% dead347,414 domains
Dead rate among common TLDs (≥20k domains in the top 10M). Hover a bar to isolate it.

The standouts tell two stories. .cn, .info, and .in lead because they are cheap and heavily registered for short-lived or speculative sites that lapse quickly. But .gov (26%) and .edu (22%) near the top is the more striking finding: institutional domains rot badly because content is reorganized, departments are dissolved, and old project sites are simply switched off — exactly the digital decay Pew Research documented in 2024, where government and reference pages had some of the worst link rot. The web's most authoritative corners are some of its least permanent.

The geography of the dead web

Group the country-code domains by country and the decay draws a map. The emerging-market registration booms of the last decade left the biggest graveyards — China's .cn leads at a third dead — while German-speaking Europe runs the most durable web on earth.

7%33% no data
Dead-domain rate by country-code TLD, 2026. Redder is deader; hover a country for its rate.
Show the data
China (.cn)33%
India (.in)25.9%
United States (.us)22%
Brazil (.br)20.9%
Spain (.es)16.6%
Japan (.jp)15.6%
United Kingdom (.uk)15.3%
Russia (.ru)14.9%
France (.fr)14.5%
Canada (.ca)14.1%
Italy (.it)13.5%
Poland (.pl)13.2%
Sweden (.se)11.6%
Switzerland (.ch)9.8%
Netherlands (.nl)9.7%
Austria (.at)8.6%
Germany (.de)7.6%
Czechia (.cz)7.3%
China (.cn)33.0%42,827
India (.in)25.9%63,614
United States (.us)22.0%41,645
Brazil (.br)20.9%99,202
Spain (.es)16.6%67,984
Japan (.jp)15.6%253,187
UK (.uk)15.3%244,776
Russia (.ru)14.9%301,639
France (.fr)14.5%135,021
Italy (.it)13.5%107,638
Netherlands (.nl)9.7%86,383
Germany (.de)7.6%348,251
Dead rate among major country-code TLDs (≥40,000 domains each). China's .cn leads; Germany's .de is the most durable.

A domain in China's .cn space is more than four times as likely to be dead as one in Germany's .de. Fast, cheap, speculative registration — and, for .cn, a churn-heavy market behind the Great Firewall — leaves more abandoned domains behind; the mature, costlier-to-register German-speaking TLDs barely rot at all.

What the top 10 million is even made of

For context, here's the shape of the corpus itself. .com is not just first — it is nearly half of the entire top 10 million, larger than every country-code and new-gTLD combined.

.com44.1%4,403,688
.org8.8%878,764
.io3.6%363,234
.de3.5%348,251
.net3.5%347,414
.ru3.0%301,639
.jp2.5%253,187
.uk2.5%244,776
.fr1.4%135,021
.edu1.3%128,968
The 10 largest TLDs in the top 10 million by domain count. .com alone is 44%.

Two details worth flagging: .io (3.6%) has quietly become the third-largest TLD on the popular web — the developer/startup default — and the AI-era .ai (0.30%, ~30,000 domains) has already overtaken established country domains like .fi, .no, and .tw in the top 10 million.

The dead web is the long tail nobody visits

Death is not spread evenly through the ranking. Split the 10 million by popularity and the dead rate climbs more than 20× — from 0.8% in the top 1,000 to 16.1% past rank 5 million. blocked runs the other way: the most-trafficked sites wall bots hardest, then the defenses thin out down the tail.

Dead0.8% → 16.1%Blocked12.9% → 8.5%
0%10%20%
Top 1K
0.8%
12.9%
1K–10K
1.2%
15.1%
10K–100K
2.3%
10.8%
100K–1M
8.7%
9.3%
1M–5M
13.3%
9.3%
5M–10M
16.1%
8.5%
Dead and blocked rate by popularity-rank band (top 10M, 2026). Dead climbs 20× into the long tail; blocked peaks at the popular head.

That gradient reframes the headline. The 14% is real by domain count — but those dead domains are almost all in the part of the web nobody visits. 99.8% of dead domains sit below rank 100,000, and the popular top-100K — where the overwhelming majority of web traffic lives — is only 2.2% dead. Weighted by attention instead of raw count, the dead web nearly disappears:

By domain count14.2%

share of the top 10M domains that are dead

Weighted by traffic~3%

the popular top-100K, where most web traffic is, is only 2.2% dead

The dead web concentrates in the unvisited tail — 99.8% of dead domains sit below rank 100K. Traffic weighting is estimated from the rank distribution.

"Dead web" is not "link rot" — and definitely not "dead internet theory"

Three different things get blurred together. Keeping them separate is the whole point:

  • This study (dead domains): does the domain still resolve and answer? We find 14.2% of the top 10M do not.
  • Link rot (Pew, Ahrefs): are the links inside living pages still good? Pew Research found 25% of pages from 2013–2023 are gone and 38% of 2013 pages have vanished; Ahrefs found 66.5% of tracked links have rotted. Those measure decay within the living web — a complement to this, not the same number.
  • Dead internet theory: the claim that AI-generated content and bots have displaced human activity online. That is about what's on the living web, not whether domains are reachable. It is a separate conversation, and conflating it with link rot is how bad statistics spread.

If you only remember one distinction: link rot is about the pages that are still up; the dead web is about the domains that aren't.

What this means if you're building a scraper or a data pipeline

The practical takeaway is the 8.9% blocked slice, because it is the part most likely to break your project. When a request fails, the reason dictates the fix, and they are nothing alike:

  • A dead domain (no DNS, refused) will never answer. Retrying, rotating proxies, or switching to a browser does nothing. Drop it and move on.
  • A blocked domain is alive and reachable — it just refused your client. A matched browser TLS/JA3 fingerprint or a residential IP gets in where a datacenter bot gets a 403. This is a transport problem, not a dead site.

This isn't theoretical. Probing every domain a second time with a real Chrome TLS/JA3 fingerprint recovered ~72,000 of the ~890,000 sites the polite bot was blocked from — enough to pull the blocked rate from 8.9% down to 8.2%. Every one of those is a live site reachable with the right client, not a dead end.

The blocked web is the web you actually want.

We cross-checked a sample of these results against Similarweb traffic, and the blocked sites are by far the valuable ones. The blocked domains in our top-ranked sample pull a median of roughly 150 million monthly visits — Reddit (4.4 billion), Canva (975 million), Quora (313 million), Claude.ai (952 million). The dead ones record under 5,000 visits each, and most register zero — a four-to-five-orders-of-magnitude gap. Sites run a wall precisely because their data is worth taking, so the 8.9% blocked slice isn't noise; it is the most valuable 8.9% of the web.

Naive crawlers can't tell these apart, so they either give up on reachable sites or burn a budget retrying gone ones. The cost-efficient pattern is to escalate only as far as a site forces you to — which is exactly how Crawlora's anti-bot unblocker works, and why it bills on success rather than per attempt. If you want to know which bucket a specific URL is in before you build, the free anti-bot checker tells you in about 30 seconds, and our companion Anti-Bot Adoption Index measures how much of the live web runs a wall at all.

Two more things the scan turned up

The web is a maze of redirects. Only 69% of domains serve their final page directly; 31% bounce through at least one redirect — and a stubborn sliver loops until our 10-hop cap. That is exactly why a crawler that doesn't follow redirects sees a web that looks half-broken.

Direct (0 hops)69.4%
1 redirect23.9%
2 redirects5.2%
3 redirects1.1%
4+ redirects0.4%
Redirects before the final page (top 10M, 2026). 31% of the web is at least one hop deep.

The dead web is stuck on HTTP. A decade into the HTTPS transition, the living web is ~78% encrypted — but dead and bot-blocked domains are barely half, abandoned before they ever got a certificate.

Alive78.4%
Redirect78.6%
Dead52.8%
Blocked47.5%
Share served over HTTPS by outcome (2026). The living web is ~78% encrypted; the dead and bot-walled web is barely half.

How we measured it

No magic — a deliberately simple, reproducible probe, run at 10-million scale.

The list. The full top 10 million domains (a DomCop/Tranco-style popularity ranking). We reached 9,992,781 of them — 99.95% coverage.

The probe. Each domain is fetched HTTPS-first from a datacenter IP, following redirects, with a short timeout and a cross-resolver DNS retry before any "DNS failed" verdict. We never submit a form, solve a CAPTCHA, log in, or fetch anything behind a wall. Every domain is probed twice — once as an honest bot, and once as a browser-like client with a real Chrome TLS/JA3 fingerprint — so we can separate "nobody's home" from "the bot wasn't let in."

The classification. A final 2xx, or a served 404/5xx (the host answered), is alive. A 403/429 or anti-bot challenge is blocked. A 3xx we can't resolve is redirect. Only no DNS, a refused/reset connection, or nothing accepting a connection is dead. That single rule — a server that answers anything is up — is the entire difference between 14.2% and 27.6%.

Limits. This is homepage-level reachability from a datacenter vantage, so it is a lower bound: a domain that blocks a datacenter bot may open for a residential browser, and a deep page can be deader (or more defended) than the homepage. Snapshot: June 2026. The full per-domain dataset — every domain, every arm — is open, and the live, searchable version is the Dead-Web Index.

Reach the live web, not the dead one

14% of the top web is gone — but 9% is alive and just blocking your bot. Crawlora escalates from a plain request to a real browser fingerprint only as far as a site demands, and bills on success. Stop retrying dead domains and stop getting 403s from live ones.

Explore the Dead-Web IndexCheck any URL free

Sources

  • Dead-Web Index — the full searchable dataset
  • Full per-domain data on GitHub (CC BY 4.0)
  • The scanner, open source
  • Pew Research — When Online Content Disappears (2024)
  • Ahrefs — Link Rot Study
  • Crawlora Anti-Bot Adoption Index — how much of the web runs a wall

Frequently asked questions

How many of the world's top websites are dead?

14.2% of the top 10 million domains are genuinely dead — about 1.41 million sites that no longer resolve in DNS or refuse every connection. That is far below the often-quoted 27.6%, which counted anti-bot blocks and answered errors as death.

What's the difference between a dead website and a blocked one?

A dead site never answers — no DNS record, or nothing accepts a TCP connection. A blocked site is alive and answering, it just refuses an automated client (a 403, 429, or anti-bot challenge). 8.9% of the top web — 891,672 sites — is blocked, not dead, a distinction naive crawlers miss.

Is the dead web the same as the dead internet theory?

No. The dead internet theory is a claim that AI-generated content and bots have replaced human activity on the living web. This study measures the opposite, concrete thing: how many domains have gone completely dark and unreachable — DNS gone, connection refused, server gone.

Why is this lower than the 27.6% dead-web figure?

Earlier top-10M crawls counted three non-dead things as dead: anti-bot 403/429 blocks, 404/5xx pages served by a live server, and domains a single flaky DNS resolver failed to look up. Classifying honestly — dead means genuinely unreachable — brings the real figure to 14.2%.

Which TLD has the most dead domains?

.cn has the highest death rate among common TLDs at 33%. Institutional TLDs like .gov (26%) and .edu (22%) also rank high — matching Pew Research's finding that government and reference pages suffer the worst link rot.

Why does a site look dead to a scraper but load fine in my browser?

Anti-bot systems serve a 403 or a challenge to a datacenter IP while letting a real browser through. A matched browser TLS/JA3 fingerprint reaches the site where a naive bot is blocked — which is why this study probes every domain twice, as a polite bot and as a browser-like client.

Share:
Explore with AI:
ChatGPTClaudeGoogle AIGrokPerplexity

About the author

Tony Wang

Tony Wang · Founder, Crawlora

Tony Wang is the founder of Crawlora and a senior software engineer with 9+ years across backend, cloud infrastructure, and large-scale web crawling — including distributed scrapers that have collected millions of profiles. He writes about web scraping, SERP and MCP APIs, and AI-agent data workflows.

View profiletonywang.io
Back to blog

Related posts

How Much of the Web Runs Anti-Bot? We Scanned the Top 1,000,000 Sites

We scanned the top 1,000,000 sites: 53.5% of the reachable web runs a managed anti-bot or WAF — and, surprisingly, the busiest sites run the least.

Scraping Sites That Block Bots: Cloudflare, DataDome & PerimeterX

Why scrapers get blocked by Cloudflare, DataDome and PerimeterX — and how to get through reliably with stealth browsers, IP rotation and clearance reuse.

Product Hunt Trends 2013–2026: How AI Agents Took Over Startup Launches

Product Hunt trends, 2013–2026: we read every leaderboard. AI agents went from 0 to 9 of the annual Top 10 by 2025, and the trend rotates monthly, not yearly.

How to Scrape eBay in 2026 (API & Python)

Three ways to scrape eBay listings, items, and sellers in 2026 — DIY Python, no-code tools, or a structured API — what each returns and the legal basics.

Why Reddit Blocked Unauthenticated JSON in 2026 (and How to Still Get Reddit Data)

Reddit deprecated unauthenticated .json endpoints in 2026 (now 403). Why it happened — AI data licensing and bots — and how to get Reddit data now.

Your Scraper Works Locally but Returns 403 on a Server. Here's Why.

Your scraper works locally but 403s from a server? Usually it's IP reputation, TLS fingerprinting, or headless detection — how to tell which, and fix it.

Browse Docs Try Playground