Tony Wang11 min readWe Followed 173 Million Domains for 8 Years. ~40M Died.
We tracked 172.9M domains across 80 Common Crawl archives (2018–2026). About 40M are dead — but vanishing from a crawl over-counts death by about a third.
There is a clean, satisfying way to measure the dead web, and it is wrong. Take a big list of domains from a few years ago, check which ones still show up today, and call the missing ones dead. Common Crawl makes this almost too easy: it has published a monthly snapshot of the web for years, so you can line up the archives and watch domains drop out.
We did exactly that — 80 monthly Common Crawl archives, from January 2018 to May 2026, covering 172,959,928 registered domains — and then we did the part everyone skips. We took the domains that had disappeared and asked them, live, whether they were actually gone. A third of them answered. The honest dead-web number is about 40 million domains, ~23% of the eight-year set — but only because we refused to count a domain as dead just because Common Crawl stopped seeing it.
Eight years of the web, sorted into four fates
Every domain in the panel gets one label from its full history across the 80 crawls. alive_present means it was still there in the most recent crawl. dead_candidate means it was established for a while, then went dark from a healthy state and never came back. dark_ambiguous means it went dark too, but the last thing we saw was a block or a server error — a murkier exit. intermittent means it only ever appeared sparsely, so its absence proves nothing.
Two slices need care before anyone reaches for a headline. The 38% intermittent is the honest disclaimer: Common Crawl is a sample, not a census, and its frontier shifts every month, so a domain that shows up in three scattered crawls and then never again has not necessarily died — it may simply have fallen out of the sample. We do not count those as deaths. That leaves the 36% that genuinely disappeared after a real run of presence (the dead_candidate and dark_ambiguous slices together, 62.9 million domains) — and even those, it turns out, are not all dead.
Disappearing isn't dying
Here is the trap, in one chart. If you stop at the Common Crawl archives and call every vanished domain dead, you get 36%. When you take a stratified 20,000-domain sample of those vanished domains and re-probe each one live — a fresh DNS lookup, a TCP connection, an HTTP request, in June 2026 — only 64.2% of them are actually unreachable. Project that back and the real dead-web figure is about 23%.
62.9M domains that dropped out of the crawl after an established run of presence
≈40M domains with no DNS or nothing accepting a connection in 2026
The gap is not noise; it is the whole methodology. A domain leaves a crawl for many reasons that have nothing to do with death: the crawler's budget moved on, the site started blocking robots, a redirect chain confused the fetcher, or the domain simply wasn't sampled that month. A disappearance is a hypothesis, not a finding. Calibration — going back and checking a representative sample against the live internet — is what turns "62.9 million domains went missing" into a defensible "about 40 million are actually dead."
So what did the vanished domains turn out to be when we knocked on the door in 2026?
Nearly a third of the domains that fell out of Common Crawl are still alive — they resolve and serve a page right now. They left the crawl, not the web. That single fact is why headline dead-web numbers built on disappearance alone run high, and why ours is deliberately, defensibly lower.
Dead and blocked are different failures
The other thing a live re-probe buys you is the distinction naive crawls erase: dead (nobody home) versus blocked (home, but won't open the door for a bot). Both look like "couldn't fetch it" to a scraper, and they demand opposite responses — a dead domain will never answer no matter what you do, while a blocked one opens for the right client. The probability a vanished domain is truly dead depends heavily on the last healthy signal we saw before it went dark.
Read the bottom bar carefully, because it is the contrarian finding: a domain last seen blocking bots is barely more likely than a coin flip to actually be dead. Almost half of those "disappeared while blocking" domains are alive and well — they just walled off the crawler and then dropped out of the sample. A domain last seen healthy, by contrast, is over 70% likely to have genuinely died: a healthy site goes quiet because it was switched off, not because it started refusing robots. If you treat every failed fetch the same, you write off a pile of live, defended sites as corpses.
This is not "dead internet theory" — and not link rot
Three different questions get blended together whenever the web's decay comes up. Keeping them apart is the point:
- This study (dead domains): does the domain still resolve and answer at all? Across an 8-year, 173-million-domain panel, about 23% no longer do.
- Link rot (Pew, Ahrefs): are the links inside still-living pages good? Pew Research found 38% of pages from 2013 had vanished by 2023; Ahrefs found two-thirds of tracked links rot. That measures decay within the living web — a complement to this, not the same number.
- Dead internet theory: the claim that bots and AI-generated content have displaced humans on the living web. That is about what's on the pages that still load — a separate conversation entirely.
If you remember one line: link rot is about the pages that are still up; the dead web is about the domains that aren't; dead internet theory is about who's writing the pages that are. This study is strictly the middle one.
How we measured it
The method is deliberately simple and fully reproducible — and the open-source tool that runs it is the same one behind this post.
The corpus. Common Crawl publishes a columnar index for each monthly crawl, including a robotstxt subset — one row per host whose robots.txt it fetched. We read that subset for all 80 crawls from CC-MAIN-2018-05 to CC-MAIN-2026-21, rolled each up to the registered domain, and recorded, per crawl, whether the domain was present and what HTTP status it returned. The union is 172,959,928 domains.
Why disappearance, not status. Common Crawl's recorded fetch_status is HTTP-only — a domain that is genuinely dead produces no row at all, not a row with an error. So death cannot be read from a status code; it has to be inferred from a domain dropping out across consecutive crawls. That inference is exactly what makes calibration mandatory.
The labels. For each domain we compute its first and last crawl, how many of the 80 it appeared in, its longest run of recent absence, and the last state we saw. A domain that was present for several crawls and then absent from the most recent ones — from a healthy state — is a dead_candidate; from a blocked or erroring state, dark_ambiguous; sparsely-seen domains are intermittent and never counted as deaths.
The calibration. We drew a stratified 20,000-domain sample of the vanished domains and re-probed each one live in June 2026 — DNS resolution, a TCP connection, and an HTTP request — classifying it as dead, alive, or blocked. Per-stratum dead rates (with Wilson 95% confidence intervals) convert the raw "this many disappeared" counts into the calibrated "this many are truly dead." That step is the difference between 36% and 23%.
Limits, honestly. Common Crawl is a sample of the web, not a census, and its coverage drifts over eight years — which is precisely why we calibrate rather than trust disappearance, but it also means the panel is biased toward domains Common Crawl chose to crawl, and a domain it never fetched a robots.txt for never enters the set. The live re-probe is a single datacenter vantage in mid-2026, so a domain that blocks datacenter IPs can read as blocked when a residential browser would get in, and a parked domain that serves a 200 reads as alive. The dead-rate projection assumes the 20k sample's per-stratum rates hold across each full stratum. Snapshot: the panel ends May 2026; the re-probe ran June 2026.
The full per-domain panel — all 172.9 million rows, every label and field — is open under CC BY 4.0, and the CLI that built it is open source. If you want the live, interactive view of today's web instead of the 8-year history, the companion Dead-Web Index and the snapshot study 14% of the Web Is Actually Dead measure the same thing on the current top-ranked web.
Tell a dead domain from a blocked one — automatically
A third of the 'dead' web isn't dead; it's alive and refusing bots. Crawlora escalates from a plain request to a real browser fingerprint only as far as a site demands, and bills on success — so you stop retrying gone domains and stop getting 403s from live ones.
Frequently asked questions
How much of the web has died since 2018?
About 23%. We built an 8-year reachability panel of 172,959,928 domains from 80 monthly Common Crawl archives (2018–2026); roughly 40 million are calibrated genuinely dead — no DNS, or nothing accepting a connection. That figure is much lower than the 36% you get by counting every domain that simply stopped appearing in the crawl, because disappearing from a sample of the web is not the same as leaving it.
Does disappearing from Common Crawl mean a domain is dead?
No. Common Crawl is a moving sample of the web, not a registry, so a domain can drop out because the crawler's budget shifted, it started blocking robots, or it simply wasn't sampled — not because it died. When we re-probed a stratified 20,000-domain sample of vanished domains live in 2026, only 64.2% were genuinely dead; 31% still resolved and answered, and 4.7% were alive but blocking bots.
What's the difference between a dead domain and a blocked one?
A dead domain never answers — no DNS record, or nothing accepts a TCP connection. A blocked domain is alive and answering; it just refuses an automated client with a 403, 429, or anti-bot challenge. They demand opposite responses (a dead domain never answers no matter what; a blocked one opens for the right client), and a domain last seen blocking bots is only 44.7% likely to be dead, versus 70.7% for one last seen healthy.
Is this the same as the dead internet theory?
No. The dead internet theory is the claim that bots and AI-generated content have displaced humans on the living web — a question about what's on the pages that still load. This study measures something concrete and different: whether the domain itself still resolves and answers at all. It is also distinct from link rot, which measures broken links inside still-living pages.
How was the dead-web Common Crawl dataset built?
From Common Crawl's monthly robotstxt index subset across 80 crawls (CC-MAIN-2018-05 to CC-MAIN-2026-21), rolled up to one row per registered domain with its presence and HTTP-status history. Because Common Crawl's status is HTTP-only, a dead host produces no row at all, so death is inferred from disappearance and then calibrated against a live re-probe. The full 172.9M-row panel is open under CC BY 4.0 and the crawlora-deadweb CLI that built it is open source.