15 KiB
EveryTab Architecture
System Overview
EveryTab is a static website that displays a page full of browser tabs representing every website on the internet. The system has two phases:
- Scanning Phase — A data pipeline that extracts website metadata from Common Crawl, downloads favicons, and processes them into servable bundles.
- Hosting Phase — A static site served via S3 + CloudFront that renders tabs using pre-built JSON bundles.
The scanning phase runs monthly (triggered by new Common Crawl releases), produces a static site, and then its infrastructure is torn down. The hosting phase runs indefinitely at minimal cost.
Common Crawl (S3)
|
v
[EC2 + DuckDB] ---> [RDS Postgres] ---> [EC2 + Go programs] ---> S3 (icons/)
| | | |
| (hosts, icons | |
| tables) v v
| | [Bundle Generator] ---> S3 (tabs/*.json)
| | |
| v v
| [Backup to homelab] S3 (index.html)
| |
v v
[Tear down EC2, RDS] [CloudFront CDN]
AWS Infrastructure
All resources in us-east-1.
| Resource | Purpose | Lifecycle |
|---|---|---|
| EC2 (xlarge, compute-optimized) | Run pipeline stages | Scanning only |
| RDS Postgres (db.t3.medium) | Store hosts/icons metadata | Scanning only (backup then delete) |
S3 everytab-icons |
Raw downloaded favicons | Scanning only (backup then delete) |
S3 everytab-site |
Static site: index.html + tabs/*.json | Permanent |
| CloudFront | CDN for static site | Permanent |
| Unbound (on EC2) | Local recursive DNS resolver | Scanning only |
Steady-State (Hosting Only)
- S3
everytab-site— stores index.html + ~50K JSON bundle files (~60GB total) - CloudFront distribution — serves the site with caching
Scanning Phase (Temporary)
- EC2 instance — runs all processing (no persistent local storage needed beyond OS)
- RDS — structured data store during pipeline execution
- S3
everytab-icons— temporary storage for downloaded favicons
Data Model
hosts table
| Column | Type | Description |
|---|---|---|
| id | SERIAL PRIMARY KEY | Internal ID |
| hostname | TEXT NOT NULL | e.g., example.com |
| protocol | TEXT NOT NULL | https or http (prefer https) |
| crawl_id | TEXT NOT NULL | CC crawl identifier (e.g., CC-MAIN-2026-05) |
| warc_filename | TEXT NOT NULL | Path to WARC file in CC's S3 |
| warc_record_offset | BIGINT NOT NULL | Byte offset into WARC file |
| warc_record_length | INT NOT NULL | Length of WARC record |
| html_title | TEXT | Extracted from <title> tag |
| iframe_allowed | BOOLEAN | True if site allows framing (no X-Frame-Options/CSP restriction) |
| best_icon_id | INT REFERENCES icons(id) | FK to the chosen icon for bundling |
| parsed | BOOLEAN DEFAULT FALSE | Whether WARC has been parsed |
Constraints: UNIQUE(hostname) — one row per domain, prefer https over http.
icons table
| Column | Type | Description |
|---|---|---|
| id | SERIAL PRIMARY KEY | Internal ID |
| host_id | INT REFERENCES hosts(id) | FK to parent host |
| url | TEXT NOT NULL | Full URL to the icon |
| source | TEXT NOT NULL | favicon_ico or link_rel |
| content_type | TEXT | MIME type after download (image/png, image/x-icon, etc.) |
| width | INT | Decoded pixel width |
| height | INT | Decoded pixel height |
| s3_key | TEXT | Key in everytab-icons bucket |
| scan_state | TEXT DEFAULT 'unscanned' | unscanned, in_progress, completed, failed |
| error | TEXT | Error message if failed |
Indexes:
idx_icons_scan_stateon (scan_state) — for batch claiming workidx_icons_host_idon (host_id) — for best-icon selection
Bundle JSON format (tabs/0001.json)
{
"entries": [
{
"host": "example.com",
"title": "Example Domain",
"icon": "iVBORw0KGgo...",
"icon_w": 32,
"icon_h": 32,
"iframe_ok": true
}
]
}
Icons are stored inline as base64-encoded PNG. Each bundle targets ~1MB, yielding approximately 500-700 entries per bundle depending on icon sizes.
Pipeline Stages
The pipeline is a series of manually-run scripts executed in order. Each stage is idempotent and resumable.
Stage 1: CC-Index Query
Tool: DuckDB with httpfs extension (or local parquet if httpfs takes >1hr)
Input: Common Crawl columnar index (parquet files on CC's S3)
Query logic:
WHERE url_path = '/'
AND content_mime_type = 'text/html'
AND fetch_status = 200
AND url_query IS NULL
AND url_protocol IN ('http', 'https')
AND url_port IN (80, 443)
Deduplication: Per hostname, prefer https over http. Result is one row per unique hostname.
Output: Populates hosts table in RDS (~30M rows for a full crawl).
Stats emitted: Total domains found, https vs http breakdown, duplicates removed.
Stage 2: WARC Parsing
Tool: Custom Go program, highly concurrent
Input: hosts table rows where parsed = FALSE
Process:
- Claim a batch of rows (set parsed = TRUE optimistically, or use a cursor)
- For each row, make a byte-range GET request to Common Crawl's S3:
Range: bytes={offset}-{offset+length-1}- Target:
s3://commoncrawl/{warc_filename}
- Parse the WARC record to extract the HTTP response
- Parse HTML (defensively — handle malformed HTML, use a lenient parser):
- Extract
<title>tag content - Extract
<link rel="icon">href values (filter to png/gif/ico, sizes 16-64px) - Check HTTP response headers for
X-Frame-Optionsand CSPframe-ancestors
- Extract
- Insert a
/favicon.icoentry intoiconsfor every host (always attempt this) - Insert any qualifying
link rel="icon"entries intoicons - Update
hostsrow withhtml_title,iframe_allowed,parsed = TRUE
Concurrency: High — thousands of goroutines. S3 byte-range requests are the bottleneck; S3 handles 5,500+ GET/s per prefix and WARC files are spread across many prefixes.
Error handling: If HTML is unparseable, mark as parsed with NULL title. If WARC fetch fails, retry once then skip. Log all errors with hostname for investigation.
Stats emitted: Rows processed, titles extracted, icons found (by type), iframe restrictions found, parse failures.
Stage 3: DNS Resolution Setup
Tool: Unbound, installed and configured on EC2
Configuration:
- Recursive resolver (no forwarding to upstream)
- Listening on 127.0.0.1:53
- Aggressive caching enabled
- High min-TTL (e.g., 3600s) to maximize cache hits across similar domains
- Configured as system resolver in
/etc/resolv.conf
This runs as a background service. No separate "DNS resolution stage" — the Go icon downloader's HTTP requests transparently use Unbound via the OS resolver. Unbound handles recursive resolution and caching.
Why: Downloading 30M+ icons without a local recursive resolver would overwhelm upstream DNS providers and likely get us rate-limited. Unbound resolves from root servers directly, caches aggressively, and handles the load locally.
Stage 4: Icon Download
Tool: Custom Go program, highly concurrent
Input: icons table rows where scan_state = 'unscanned'
Process:
- Claim a batch of rows (UPDATE scan_state = 'in_progress' WHERE scan_state = 'unscanned' LIMIT N RETURNING *)
- For each icon URL:
- Make HTTP(S) GET request (normal Go HTTP client, DNS goes through Unbound)
- Enforce timeout (5s connect, 10s total)
- Enforce max download size (512KB — generous for icons)
- On success: validate it's an image (check magic bytes), decode to get dimensions
- Upload raw bytes to S3
everytab-icons/{hash}(content-addressed) - Update
iconsrow: s3_key, content_type, width, height, scan_state = 'completed' - On failure: scan_state = 'failed', error = reason
Concurrency: Maximize throughput — goroutine pool with configurable size (start at 1000, tune based on memory/bandwidth). Use semaphore pattern for backpressure.
Fast failure: DNS errors, connection refused, timeouts all fail immediately (no retry for icons — if it's down, it's down). This keeps the long tail short.
Scaling to fleet: If a single instance is insufficient:
- Multiple EC2 instances run the same binary
- Each claims work via the
scan_stateUPDATE (Postgres row-level locking prevents double-work) - No coordination needed beyond the shared database
Stats emitted: Icons attempted, completed, failed (by error type: DNS, timeout, HTTP error, invalid image, too large), download rate (icons/sec), bytes downloaded.
Stage 5: Best Icon Selection
Tool: SQL query or small script
Process: For each host, select the best icon from its completed icons:
- Filter to standard sizes: 16x16, 32x32, 48x48, 64x64
- Among those, pick the largest dimensions (prefer 64 > 48 > 32 > 16)
- If no standard sizes found, pick the largest icon with dimensions <= 64px on both axes
- If no icons at all, host gets a NULL best_icon_id (will use default in frontend)
UPDATE hosts h SET best_icon_id = (
SELECT id FROM icons i
WHERE i.host_id = h.id AND i.scan_state = 'completed'
ORDER BY
(width IN (16,32,48,64) AND height IN (16,32,48,64)) DESC,
width DESC
LIMIT 1
);
Stats emitted: Hosts with icons, hosts without icons, icon size distribution.
Stage 6: Bundle Generation
Tool: Custom Go program
Input: hosts table (joined with their best icon from S3)
Process:
- Query all hosts where best_icon_id IS NOT NULL (or include no-icon hosts with a default flag)
- Randomize the full result set (ORDER BY random() or shuffle in memory)
- For each host:
- Download its best icon from S3
everytab-icons - Decode the icon (ICO/GIF/PNG/etc.)
- For ICO files: extract the largest embedded image at a standard size <= 64x64
- Re-encode as PNG (optimized compression)
- Base64-encode the PNG bytes
- Download its best icon from S3
- Chunk into groups of N entries (~500-700, tuned so each JSON is ~1MB)
- Write each chunk as
tabs/{n}.jsonto S3everytab-site - Record total bundle count
Output:
tabs/0000.jsonthroughtabs/{M}.jsonin S3- Total bundle count M (used in frontend build)
Stats emitted: Total bundles created, total hosts included, total hosts excluded (no icon), average bundle size, total S3 storage used.
Stage 7: Frontend Build
Tool: Script/template that produces index.html
Process:
- Inject
TOTAL_BUNDLESconstant into the JS (baked at build time) - Minify if desired
- Upload
index.htmlto S3everytab-siteroot
Stage 8: CloudFront Invalidation
Invalidate /* on the CloudFront distribution so the new site is live.
Stage 9: Backup & Teardown
Process:
- Dump RDS database to local machine (homelab) —
pg_dumpover SSH tunnel or direct - Sync S3
everytab-iconsto homelab storage —aws s3 sync - Confirm backups are complete
- Delete RDS instance
- Delete S3
everytab-iconsbucket - Terminate EC2 instance
Frontend Architecture
Single-File Design
One index.html containing inline CSS and JS. No external dependencies, no framework. Two HTTP requests per initial page load:
GET /index.html(HTML + CSS + JS, likely <50KB)GET /tabs/{random}.json(~1MB, one bundle of ~500-700 tabs)
Tab Rendering
- Tabs fill the viewport in rows, styled to mimic Firefox browser tabs (v1)
- Each row has a slight horizontal marquee animation (CSS) at varying speeds
- Tab density adapts to viewport width (responsive)
- Each tab shows: favicon (or blank for no-icon) + truncated title
Interaction
- Click tab (iframe_ok=true): Opens an iframe overlay showing the actual site
- Click tab (iframe_ok=false): Opens site in a new tab (with external link indicator)
- Close: X button or click-away dismisses the iframe/overlay
- Scroll down: Triggers fetch of additional random bundles (infinite scroll)
Randomization
- Seed: current UTC date (so everyone on the same day sees the same "shuffle", but it changes daily)
- Generate random bundle index in range [0, TOTAL_BUNDLES)
- Track fetched bundle IDs in a Set to avoid duplicates on scroll
No-Icon Hosts
Hosts without a favicon are included in bundles with "icon": null. Frontend renders these Firefox-style: just the title text with no icon. This matches Firefox's behavior for tabs without favicons.
Cost Estimate
Scanning Phase (One-Time per Crawl)
| Item | Estimate |
|---|---|
| EC2 c5.xlarge (~24-48hrs) | $8-16 |
| RDS db.t3.medium (~48hrs) | $3-5 |
| S3 icons storage (temporary, ~500GB) | $12 (prorated to days) |
| S3 GET requests (30M WARC reads) | $12 |
| Data transfer (icon downloads, ~500GB inbound) | $0 (inbound is free) |
| Total | ~$35-45 |
Hosting Phase (Monthly Steady-State)
| Item | Estimate |
|---|---|
| S3 storage (~60GB bundles) | $1.40 |
| CloudFront (free tier: 1TB/month, 10M requests) | $0* |
| S3 requests (via CloudFront origin pulls, cached) | ~$1-5 |
| Total | ~$3-10/month |
*CloudFront free tier covers moderate traffic. Costs increase if the site goes viral, but that's a good problem to have.
Scaling Strategy
Development (100K domains)
- Single EC2 instance
- All stages complete in minutes-to-hours
- Good for validating the full pipeline end-to-end
Full Scan (30M domains)
- Single EC2 instance, high concurrency
- CC-Index query: <1hr
- WARC parsing: 2-6hrs (limited by S3 request rate)
- Icon download: 12-48hrs (limited by network + remote server response times)
- Bundle generation: 1-2hrs
Fleet Scaling (if needed)
- Spin up N identical EC2 instances running the icon downloader
- All share the same RDS instance
- Work claiming via Postgres atomic UPDATEs (no coordinator needed)
- Linear scaling: 4 instances = ~4x throughput
Key Design Decisions
- Static-only hosting — No servers running for the live site. Entire frontend is pre-built.
- Inline icons in bundles — No per-icon requests. One bundle fetch gives you ~600 tabs to render.
- Unbound as system resolver — Transparent to application code. Go HTTP client works normally; DNS just happens to resolve locally.
- Content-addressed icon storage — S3 key is the content hash. Natural dedup at storage layer during scanning (but icons are duplicated across bundles for simplicity).
- Resumable pipeline — Each stage uses database state (parsed, scan_state) to track progress. Crash and restart without re-doing completed work.
- PNG as universal icon format — All icons converted to PNG for bundles regardless of source format. Smallest file size for small raster images, universally supported in browsers via data URIs.
- Date-seeded randomization — Everyone visiting on the same day sees the same tab arrangement, creating a shared experience. Changes daily for freshness.