# EveryTab Architecture ## System Overview EveryTab is a static website that displays a page full of browser tabs representing every website on the internet. The system has two phases: 1. **Scanning Phase** — A data pipeline that extracts website metadata from Common Crawl, downloads favicons, and processes them into servable bundles. 2. **Hosting Phase** — A static site served via S3 + CloudFront that renders tabs using pre-built JSON bundles. The scanning phase runs monthly (triggered by new Common Crawl releases), produces a static site, and then its infrastructure is torn down. The hosting phase runs indefinitely at minimal cost. ``` Common Crawl (S3) | v [EC2 + DuckDB] ---> [RDS Postgres] ---> [EC2 + Go programs] ---> S3 (icons/) | | | | | (hosts, icons | | | tables) v v | | [Bundle Generator] ---> S3 (tabs/*.json) | | | | v v | [Backup to homelab] S3 (index.html) | | v v [Tear down EC2, RDS] [CloudFront CDN] ``` ## AWS Infrastructure All resources in **us-east-1**. | Resource | Purpose | Lifecycle | |----------|---------|-----------| | EC2 (xlarge, compute-optimized) | Run pipeline stages | Scanning only | | RDS Postgres (db.t3.medium) | Store hosts/icons metadata | Scanning only (backup then delete) | | S3 `everytab-icons` | Raw downloaded favicons | Scanning only (backup then delete) | | S3 `everytab-site` | Static site: index.html + tabs/*.json | Permanent | | CloudFront | CDN for static site | Permanent | | Unbound (on EC2) | Local recursive DNS resolver | Scanning only | ### Steady-State (Hosting Only) - S3 `everytab-site` — stores index.html + ~50K JSON bundle files (~60GB total) - CloudFront distribution — serves the site with caching ### Scanning Phase (Temporary) - EC2 instance — runs all processing (no persistent local storage needed beyond OS) - RDS — structured data store during pipeline execution - S3 `everytab-icons` — temporary storage for downloaded favicons ## Data Model ### `hosts` table | Column | Type | Description | |--------|------|-------------| | id | SERIAL PRIMARY KEY | Internal ID | | hostname | TEXT NOT NULL | e.g., `example.com` | | protocol | TEXT NOT NULL | `https` or `http` (prefer https) | | crawl_id | TEXT NOT NULL | CC crawl identifier (e.g., `CC-MAIN-2026-05`) | | warc_filename | TEXT NOT NULL | Path to WARC file in CC's S3 | | warc_record_offset | BIGINT NOT NULL | Byte offset into WARC file | | warc_record_length | INT NOT NULL | Length of WARC record | | html_title | TEXT | Extracted from `` tag | | iframe_allowed | BOOLEAN | True if site allows framing (no X-Frame-Options/CSP restriction) | | best_icon_id | INT REFERENCES icons(id) | FK to the chosen icon for bundling | | parsed | BOOLEAN DEFAULT FALSE | Whether WARC has been parsed | **Constraints:** UNIQUE(hostname) — one row per domain, prefer https over http. ### `icons` table | Column | Type | Description | |--------|------|-------------| | id | SERIAL PRIMARY KEY | Internal ID | | host_id | INT REFERENCES hosts(id) | FK to parent host | | url | TEXT NOT NULL | Full URL to the icon | | source | TEXT NOT NULL | `favicon_ico` or `link_rel` | | content_type | TEXT | MIME type after download (image/png, image/x-icon, etc.) | | width | INT | Decoded pixel width | | height | INT | Decoded pixel height | | s3_key | TEXT | Key in everytab-icons bucket | | scan_state | TEXT DEFAULT 'unscanned' | `unscanned`, `in_progress`, `completed`, `failed` | | error | TEXT | Error message if failed | **Indexes:** - `idx_icons_scan_state` on (scan_state) — for batch claiming work - `idx_icons_host_id` on (host_id) — for best-icon selection ### Bundle JSON format (`tabs/0001.json`) ```json { "entries": [ { "host": "example.com", "title": "Example Domain", "icon": "iVBORw0KGgo...", "icon_w": 32, "icon_h": 32, "iframe_ok": true } ] } ``` Icons are stored inline as base64-encoded PNG. Each bundle targets ~1MB, yielding approximately 500-700 entries per bundle depending on icon sizes. ## Pipeline Stages The pipeline is a series of manually-run scripts executed in order. Each stage is idempotent and resumable. ### Stage 1: CC-Index Query **Tool:** DuckDB with httpfs extension (or local parquet if httpfs takes >1hr) **Input:** Common Crawl columnar index (parquet files on CC's S3) **Query logic:** ```sql WHERE url_path = '/' AND content_mime_type = 'text/html' AND fetch_status = 200 AND url_query IS NULL AND url_protocol IN ('http', 'https') AND url_port IN (80, 443) ``` **Deduplication:** Per hostname, prefer `https` over `http`. Result is one row per unique hostname. **Output:** Populates `hosts` table in RDS (~30M rows for a full crawl). **Stats emitted:** Total domains found, https vs http breakdown, duplicates removed. ### Stage 2: WARC Parsing **Tool:** Custom Go program, highly concurrent **Input:** `hosts` table rows where `parsed = FALSE` **Process:** 1. Claim a batch of rows (set parsed = TRUE optimistically, or use a cursor) 2. For each row, make a byte-range GET request to Common Crawl's S3: - `Range: bytes={offset}-{offset+length-1}` - Target: `s3://commoncrawl/{warc_filename}` 3. Parse the WARC record to extract the HTTP response 4. Parse HTML (defensively — handle malformed HTML, use a lenient parser): - Extract `<title>` tag content - Extract `<link rel="icon">` href values (filter to png/gif/ico, sizes 16-64px) - Check HTTP response headers for `X-Frame-Options` and CSP `frame-ancestors` 5. Insert a `/favicon.ico` entry into `icons` for every host (always attempt this) 6. Insert any qualifying `link rel="icon"` entries into `icons` 7. Update `hosts` row with `html_title`, `iframe_allowed`, `parsed = TRUE` **Concurrency:** High — thousands of goroutines. S3 byte-range requests are the bottleneck; S3 handles 5,500+ GET/s per prefix and WARC files are spread across many prefixes. **Error handling:** If HTML is unparseable, mark as parsed with NULL title. If WARC fetch fails, retry once then skip. Log all errors with hostname for investigation. **Stats emitted:** Rows processed, titles extracted, icons found (by type), iframe restrictions found, parse failures. ### Stage 3: DNS Resolution Setup **Tool:** Unbound, installed and configured on EC2 **Configuration:** - Recursive resolver (no forwarding to upstream) - Listening on 127.0.0.1:53 - Aggressive caching enabled - High min-TTL (e.g., 3600s) to maximize cache hits across similar domains - Configured as system resolver in `/etc/resolv.conf` This runs as a background service. No separate "DNS resolution stage" — the Go icon downloader's HTTP requests transparently use Unbound via the OS resolver. Unbound handles recursive resolution and caching. **Why:** Downloading 30M+ icons without a local recursive resolver would overwhelm upstream DNS providers and likely get us rate-limited. Unbound resolves from root servers directly, caches aggressively, and handles the load locally. ### Stage 4: Icon Download **Tool:** Custom Go program, highly concurrent **Input:** `icons` table rows where `scan_state = 'unscanned'` **Process:** 1. Claim a batch of rows (UPDATE scan_state = 'in_progress' WHERE scan_state = 'unscanned' LIMIT N RETURNING *) 2. For each icon URL: - Make HTTP(S) GET request (normal Go HTTP client, DNS goes through Unbound) - Enforce timeout (5s connect, 10s total) - Enforce max download size (512KB — generous for icons) - On success: validate it's an image (check magic bytes), decode to get dimensions - Upload raw bytes to S3 `everytab-icons/{hash}` (content-addressed) - Update `icons` row: s3_key, content_type, width, height, scan_state = 'completed' - On failure: scan_state = 'failed', error = reason **Concurrency:** Maximize throughput — goroutine pool with configurable size (start at 1000, tune based on memory/bandwidth). Use semaphore pattern for backpressure. **Fast failure:** DNS errors, connection refused, timeouts all fail immediately (no retry for icons — if it's down, it's down). This keeps the long tail short. **Scaling to fleet:** If a single instance is insufficient: - Multiple EC2 instances run the same binary - Each claims work via the `scan_state` UPDATE (Postgres row-level locking prevents double-work) - No coordination needed beyond the shared database **Stats emitted:** Icons attempted, completed, failed (by error type: DNS, timeout, HTTP error, invalid image, too large), download rate (icons/sec), bytes downloaded. ### Stage 5: Best Icon Selection **Tool:** SQL query or small script **Process:** For each host, select the best icon from its completed icons: 1. Filter to standard sizes: 16x16, 32x32, 48x48, 64x64 2. Among those, pick the largest dimensions (prefer 64 > 48 > 32 > 16) 3. If no standard sizes found, pick the largest icon with dimensions <= 64px on both axes 4. If no icons at all, host gets a NULL best_icon_id (will use default in frontend) ```sql UPDATE hosts h SET best_icon_id = ( SELECT id FROM icons i WHERE i.host_id = h.id AND i.scan_state = 'completed' ORDER BY (width IN (16,32,48,64) AND height IN (16,32,48,64)) DESC, width DESC LIMIT 1 ); ``` **Stats emitted:** Hosts with icons, hosts without icons, icon size distribution. ### Stage 6: Bundle Generation **Tool:** Custom Go program **Input:** `hosts` table (joined with their best icon from S3) **Process:** 1. Query all hosts where best_icon_id IS NOT NULL (or include no-icon hosts with a default flag) 2. Randomize the full result set (ORDER BY random() or shuffle in memory) 3. For each host: - Download its best icon from S3 `everytab-icons` - Decode the icon (ICO/GIF/PNG/etc.) - For ICO files: extract the largest embedded image at a standard size <= 64x64 - Re-encode as PNG (optimized compression) - Base64-encode the PNG bytes 4. Chunk into groups of N entries (~500-700, tuned so each JSON is ~1MB) 5. Write each chunk as `tabs/{n}.json` to S3 `everytab-site` 6. Record total bundle count **Output:** - `tabs/0000.json` through `tabs/{M}.json` in S3 - Total bundle count M (used in frontend build) **Stats emitted:** Total bundles created, total hosts included, total hosts excluded (no icon), average bundle size, total S3 storage used. ### Stage 7: Frontend Build **Tool:** Script/template that produces `index.html` **Process:** 1. Inject `TOTAL_BUNDLES` constant into the JS (baked at build time) 2. Minify if desired 3. Upload `index.html` to S3 `everytab-site` root ### Stage 8: CloudFront Invalidation Invalidate `/*` on the CloudFront distribution so the new site is live. ### Stage 9: Backup & Teardown **Process:** 1. Dump RDS database to local machine (homelab) — `pg_dump` over SSH tunnel or direct 2. Sync S3 `everytab-icons` to homelab storage — `aws s3 sync` 3. Confirm backups are complete 4. Delete RDS instance 5. Delete S3 `everytab-icons` bucket 6. Terminate EC2 instance ## Frontend Architecture ### Single-File Design One `index.html` containing inline CSS and JS. No external dependencies, no framework. Two HTTP requests per initial page load: 1. `GET /index.html` (HTML + CSS + JS, likely <50KB) 2. `GET /tabs/{random}.json` (~1MB, one bundle of ~500-700 tabs) ### Tab Rendering - Tabs fill the viewport in rows, styled to mimic Firefox browser tabs (v1) - Each row has a slight horizontal marquee animation (CSS) at varying speeds - Tab density adapts to viewport width (responsive) - Each tab shows: favicon (or blank for no-icon) + truncated title ### Interaction - **Click tab (iframe_ok=true):** Opens an iframe overlay showing the actual site - **Click tab (iframe_ok=false):** Opens site in a new tab (with external link indicator) - **Close:** X button or click-away dismisses the iframe/overlay - **Scroll down:** Triggers fetch of additional random bundles (infinite scroll) ### Randomization - Seed: current UTC date (so everyone on the same day sees the same "shuffle", but it changes daily) - Generate random bundle index in range [0, TOTAL_BUNDLES) - Track fetched bundle IDs in a Set to avoid duplicates on scroll ### No-Icon Hosts Hosts without a favicon are included in bundles with `"icon": null`. Frontend renders these Firefox-style: just the title text with no icon. This matches Firefox's behavior for tabs without favicons. ## Cost Estimate ### Scanning Phase (One-Time per Crawl) | Item | Estimate | |------|----------| | EC2 c5.xlarge (~24-48hrs) | $8-16 | | RDS db.t3.medium (~48hrs) | $3-5 | | S3 icons storage (temporary, ~500GB) | $12 (prorated to days) | | S3 GET requests (30M WARC reads) | $12 | | Data transfer (icon downloads, ~500GB inbound) | $0 (inbound is free) | | **Total** | **~$35-45** | ### Hosting Phase (Monthly Steady-State) | Item | Estimate | |------|----------| | S3 storage (~60GB bundles) | $1.40 | | CloudFront (free tier: 1TB/month, 10M requests) | $0* | | S3 requests (via CloudFront origin pulls, cached) | ~$1-5 | | **Total** | **~$3-10/month** | *CloudFront free tier covers moderate traffic. Costs increase if the site goes viral, but that's a good problem to have. ## Scaling Strategy ### Development (100K domains) - Single EC2 instance - All stages complete in minutes-to-hours - Good for validating the full pipeline end-to-end ### Full Scan (30M domains) - Single EC2 instance, high concurrency - CC-Index query: <1hr - WARC parsing: 2-6hrs (limited by S3 request rate) - Icon download: 12-48hrs (limited by network + remote server response times) - Bundle generation: 1-2hrs ### Fleet Scaling (if needed) - Spin up N identical EC2 instances running the icon downloader - All share the same RDS instance - Work claiming via Postgres atomic UPDATEs (no coordinator needed) - Linear scaling: 4 instances = ~4x throughput ## Key Design Decisions 1. **Static-only hosting** — No servers running for the live site. Entire frontend is pre-built. 2. **Inline icons in bundles** — No per-icon requests. One bundle fetch gives you ~600 tabs to render. 3. **Unbound as system resolver** — Transparent to application code. Go HTTP client works normally; DNS just happens to resolve locally. 4. **Content-addressed icon storage** — S3 key is the content hash. Natural dedup at storage layer during scanning (but icons are duplicated across bundles for simplicity). 5. **Resumable pipeline** — Each stage uses database state (parsed, scan_state) to track progress. Crash and restart without re-doing completed work. 6. **PNG as universal icon format** — All icons converted to PNG for bundles regardless of source format. Smallest file size for small raster images, universally supported in browsers via data URIs. 7. **Date-seeded randomization** — Everyone visiting on the same day sees the same tab arrangement, creating a shared experience. Changes daily for freshness.