The WorldWideWeb Pages

About

This site is an attempt to index the <TITLE> and <META NAME="description"> of as many Web domains as possible and present them as a directory in the style of white or yellow pages.

The list of domains was sourced from the most recent Common Crawl domain-level web graph.

(Anticipated) FAQs

How?
I loaded up an sqlite3 file with all the domains above and wrote an indexer in Go to plow through them with as many goroutines as my $2.50/mo VPS would let me. (It's a little more nuanced than that; source code will be available some day after I clean things up a bit).
The API is another Go program that serves everything from sqlite3 so it's as performant as you'd expect for $2.50/mo.
Caddy sits in front of that for static file hosting and routing the API, and Cloudflare sits in front of that for caching.
Iframes? Tables? Old HTML tags? Why?
Aesthetics.
This isn't all domains, what about...
The Common Crawl list is only from 6 months worth of crawling from Sept '22 through Feb '23 and only contains ~88mm domains.
The alternatives include using zone files from ICANN's Centralized Zone Data Service, but they don't include ccTLD domains; and third-party sources, but they cost money and I'm cheap. (see $2.50/mo VPS)
Cloudflare has a constantly-updating list of their top 1 million domains which is nice, but I wanted more.
What about subdomains?
Common Crawl's host-level webgraph was another option, as it contains non-paid subdomains, but I felt my indexer was wasting a lot of time on non-existent or spammy-looking, redundant, sites.
In hindsight, filtering the host-level list against the private domains in the Public Suffix List would have been a better idea. Maybe after the next Common Crawl web graph drops.
Why so many foreign domains?
It's the World Wide Web not the Anglosphere Wide Web. I did try to capture the <HTML> "lang" attribute for use as a filter, but found it to be missing on a lot of sites and gibberish on others.
What's with all the mojibake?
I dunno. I try to use the provided charset and convert to UTF-8 when available, but like the <HTML> "lang" attribute, there's some weird stuff out there.
What about searching?
It's a good idea, but I'd need a bit more space (and time) to generate a full-text index.
Imagine a search engine limited to the first 128 characters of the tile and 512 characters of the description; SEO BTFO.
How can I discover completely random domains?
Check out the firehose of sites as they're being indexed (when running).