The WorldWideWeb Pages

About

This site is an attempt to index the <TITLE> and <META NAME="description"> of as many Web domains as possible and present them as a directory in the style of white or yellow pages.

The list of domains was sourced from the most recent Common Crawl domain-level web graph.

(Anticipated) FAQs

How?: I loaded up an sqlite3 file with all the domains above and wrote an indexer in Go to plow through them with as many goroutines as my $2.50/mo VPS would let me. (It's a little more nuanced than that; source code will be available some day after I clean things up a bit).
The API is another Go program that serves everything from sqlite3 so it's as performant as you'd expect for $2.50/mo.
Caddy sits in front of that for static file hosting and routing the API, and Cloudflare sits in front of that for caching.
Iframes? Tables? Old HTML tags? Why?: Aesthetics.
This isn't all domains, what about...: The Common Crawl list is only from 6 months worth of crawling from Sept '22 through Feb '23 and only contains ~88mm domains.
The alternatives include using zone files from ICANN's Centralized Zone Data Service, but they don't include ccTLD domains; and third-party sources, but they cost money and I'm cheap. (see $2.50/mo VPS)
Cloudflare has a constantly-updating list of their top 1 million domains which is nice, but I wanted more.
What about subdomains?: Common Crawl's host-level webgraph was another option, as it contains non-paid subdomains, but I felt my indexer was wasting a lot of time on non-existent or spammy-looking, redundant, sites.
In hindsight, filtering the host-level list against the private domains in the Public Suffix List would have been a better idea. Maybe after the next Common Crawl web graph drops.
Why so many foreign domains?: It's the World Wide Web not the Anglosphere Wide Web. I did try to capture the <HTML> "lang" attribute for use as a filter, but found it to be missing on a lot of sites and gibberish on others.
What's with all the mojibake?: I dunno. I try to use the provided charset and convert to UTF-8 when available, but like the <HTML> "lang" attribute, there's some weird stuff out there.
What about searching?: It's a good idea, but I'd need a bit more space (and time) to generate a full-text index.
Imagine a search engine limited to the first 128 characters of the tile and 512 characters of the description; SEO BTFO.
How can I discover completely random domains?: Check out the firehose of sites as they're being indexed (when running).