The WorldWideWeb Pages
About
This site is an attempt to index the <TITLE> and <META NAME="description"> of as many Web domains as possible and present them as a directory in the style of white or yellow pages.
The list of domains was sourced from the most recent Common Crawl domain-level web graph.
(Anticipated) FAQs
- How?
-
I loaded up an sqlite3 file with all the domains above and wrote an indexer in Go to plow through them with as many goroutines as my $2.50/mo VPS would let me. (It's a little more nuanced than that; source code will be available some day after I clean things up a bit).
The API is another Go program that serves everything from sqlite3 so it's as performant as you'd expect for $2.50/mo.
Caddy sits in front of that for static file hosting and routing the API, and Cloudflare sits in front of that for caching.
- Iframes? Tables? Old HTML tags? Why?
-
Aesthetics.
- This isn't all domains, what about...
-
The Common Crawl list is only from 6 months worth of crawling from Sept '22 through Feb '23 and only contains ~88mm domains.
The alternatives include using zone files from ICANN's Centralized Zone Data Service, but they don't include ccTLD domains; and third-party sources, but they cost money and I'm cheap. (see $2.50/mo VPS)
Cloudflare has a constantly-updating list of their top 1 million domains which is nice, but I wanted more.
- What about subdomains?
-
Common Crawl's host-level webgraph was another option, as it contains non-paid subdomains, but I felt my indexer was wasting a lot of time on non-existent or spammy-looking, redundant, sites.
In hindsight, filtering the host-level list against the private domains in the Public Suffix List would have been a better idea. Maybe after the next Common Crawl web graph drops.
- Why so many foreign domains?
- It's the World Wide Web not the Anglosphere Wide Web. I did try to capture the <HTML> "lang" attribute for use as a filter, but found it to be missing on a lot of sites and gibberish on others.
- What's with all the mojibake?
- I dunno. I try to use the provided charset and convert to UTF-8 when available, but like the <HTML> "lang" attribute, there's some weird stuff out there.
- What about searching?
- It's a good idea, but I'd need a bit more space (and time) to generate a full-text index.
Imagine a search engine limited to the first 128 characters of the tile and 512 characters of the description; SEO BTFO.
- How can I discover completely random domains?
- Check out the firehose of sites as they're being indexed (when running).