diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md new file mode 100644 index 0000000..4eeb3ac --- /dev/null +++ b/ARCHITECTURE.md @@ -0,0 +1,360 @@ +# EveryTab Architecture + +## System Overview + +EveryTab is a static website that displays a page full of browser tabs representing every website on the internet. The system has two phases: + +1. **Scanning Phase** — A data pipeline that extracts website metadata from Common Crawl, downloads favicons, and processes them into servable bundles. +2. **Hosting Phase** — A static site served via S3 + CloudFront that renders tabs using pre-built JSON bundles. + +The scanning phase runs monthly (triggered by new Common Crawl releases), produces a static site, and then its infrastructure is torn down. The hosting phase runs indefinitely at minimal cost. + +``` +Common Crawl (S3) + | + v +[EC2 + DuckDB] ---> [RDS Postgres] ---> [EC2 + Go programs] ---> S3 (icons/) + | | | | + | (hosts, icons | | + | tables) v v + | | [Bundle Generator] ---> S3 (tabs/*.json) + | | | + | v v + | [Backup to homelab] S3 (index.html) + | | + v v + [Tear down EC2, RDS] [CloudFront CDN] +``` + +## AWS Infrastructure + +All resources in **us-east-1**. + +| Resource | Purpose | Lifecycle | +|----------|---------|-----------| +| EC2 (xlarge, compute-optimized) | Run pipeline stages | Scanning only | +| RDS Postgres (db.t3.medium) | Store hosts/icons metadata | Scanning only (backup then delete) | +| S3 `everytab-icons` | Raw downloaded favicons | Scanning only (backup then delete) | +| S3 `everytab-site` | Static site: index.html + tabs/*.json | Permanent | +| CloudFront | CDN for static site | Permanent | +| Unbound (on EC2) | Local recursive DNS resolver | Scanning only | + +### Steady-State (Hosting Only) +- S3 `everytab-site` — stores index.html + ~50K JSON bundle files (~60GB total) +- CloudFront distribution — serves the site with caching + +### Scanning Phase (Temporary) +- EC2 instance — runs all processing (no persistent local storage needed beyond OS) +- RDS — structured data store during pipeline execution +- S3 `everytab-icons` — temporary storage for downloaded favicons + +## Data Model + +### `hosts` table + +| Column | Type | Description | +|--------|------|-------------| +| id | SERIAL PRIMARY KEY | Internal ID | +| hostname | TEXT NOT NULL | e.g., `example.com` | +| protocol | TEXT NOT NULL | `https` or `http` (prefer https) | +| crawl_id | TEXT NOT NULL | CC crawl identifier (e.g., `CC-MAIN-2026-05`) | +| warc_filename | TEXT NOT NULL | Path to WARC file in CC's S3 | +| warc_record_offset | BIGINT NOT NULL | Byte offset into WARC file | +| warc_record_length | INT NOT NULL | Length of WARC record | +| html_title | TEXT | Extracted from `` tag | +| iframe_allowed | BOOLEAN | True if site allows framing (no X-Frame-Options/CSP restriction) | +| best_icon_id | INT REFERENCES icons(id) | FK to the chosen icon for bundling | +| parsed | BOOLEAN DEFAULT FALSE | Whether WARC has been parsed | + +**Constraints:** UNIQUE(hostname) — one row per domain, prefer https over http. + +### `icons` table + +| Column | Type | Description | +|--------|------|-------------| +| id | SERIAL PRIMARY KEY | Internal ID | +| host_id | INT REFERENCES hosts(id) | FK to parent host | +| url | TEXT NOT NULL | Full URL to the icon | +| source | TEXT NOT NULL | `favicon_ico` or `link_rel` | +| content_type | TEXT | MIME type after download (image/png, image/x-icon, etc.) | +| width | INT | Decoded pixel width | +| height | INT | Decoded pixel height | +| s3_key | TEXT | Key in everytab-icons bucket | +| scan_state | TEXT DEFAULT 'unscanned' | `unscanned`, `in_progress`, `completed`, `failed` | +| error | TEXT | Error message if failed | + +**Indexes:** +- `idx_icons_scan_state` on (scan_state) — for batch claiming work +- `idx_icons_host_id` on (host_id) — for best-icon selection + +### Bundle JSON format (`tabs/0001.json`) + +```json +{ + "entries": [ + { + "host": "example.com", + "title": "Example Domain", + "icon": "iVBORw0KGgo...", + "icon_w": 32, + "icon_h": 32, + "iframe_ok": true + } + ] +} +``` + +Icons are stored inline as base64-encoded PNG. Each bundle targets ~1MB, yielding approximately 500-700 entries per bundle depending on icon sizes. + +## Pipeline Stages + +The pipeline is a series of manually-run scripts executed in order. Each stage is idempotent and resumable. + +### Stage 1: CC-Index Query + +**Tool:** DuckDB with httpfs extension (or local parquet if httpfs takes >1hr) + +**Input:** Common Crawl columnar index (parquet files on CC's S3) + +**Query logic:** +```sql +WHERE url_path = '/' + AND content_mime_type = 'text/html' + AND fetch_status = 200 + AND url_query IS NULL + AND url_protocol IN ('http', 'https') + AND url_port IN (80, 443) +``` + +**Deduplication:** Per hostname, prefer `https` over `http`. Result is one row per unique hostname. + +**Output:** Populates `hosts` table in RDS (~30M rows for a full crawl). + +**Stats emitted:** Total domains found, https vs http breakdown, duplicates removed. + +### Stage 2: WARC Parsing + +**Tool:** Custom Go program, highly concurrent + +**Input:** `hosts` table rows where `parsed = FALSE` + +**Process:** +1. Claim a batch of rows (set parsed = TRUE optimistically, or use a cursor) +2. For each row, make a byte-range GET request to Common Crawl's S3: + - `Range: bytes={offset}-{offset+length-1}` + - Target: `s3://commoncrawl/{warc_filename}` +3. Parse the WARC record to extract the HTTP response +4. Parse HTML (defensively — handle malformed HTML, use a lenient parser): + - Extract `<title>` tag content + - Extract `<link rel="icon">` href values (filter to png/gif/ico, sizes 16-64px) + - Check HTTP response headers for `X-Frame-Options` and CSP `frame-ancestors` +5. Insert a `/favicon.ico` entry into `icons` for every host (always attempt this) +6. Insert any qualifying `link rel="icon"` entries into `icons` +7. Update `hosts` row with `html_title`, `iframe_allowed`, `parsed = TRUE` + +**Concurrency:** High — thousands of goroutines. S3 byte-range requests are the bottleneck; S3 handles 5,500+ GET/s per prefix and WARC files are spread across many prefixes. + +**Error handling:** If HTML is unparseable, mark as parsed with NULL title. If WARC fetch fails, retry once then skip. Log all errors with hostname for investigation. + +**Stats emitted:** Rows processed, titles extracted, icons found (by type), iframe restrictions found, parse failures. + +### Stage 3: DNS Resolution Setup + +**Tool:** Unbound, installed and configured on EC2 + +**Configuration:** +- Recursive resolver (no forwarding to upstream) +- Listening on 127.0.0.1:53 +- Aggressive caching enabled +- High min-TTL (e.g., 3600s) to maximize cache hits across similar domains +- Configured as system resolver in `/etc/resolv.conf` + +This runs as a background service. No separate "DNS resolution stage" — the Go icon downloader's HTTP requests transparently use Unbound via the OS resolver. Unbound handles recursive resolution and caching. + +**Why:** Downloading 30M+ icons without a local recursive resolver would overwhelm upstream DNS providers and likely get us rate-limited. Unbound resolves from root servers directly, caches aggressively, and handles the load locally. + +### Stage 4: Icon Download + +**Tool:** Custom Go program, highly concurrent + +**Input:** `icons` table rows where `scan_state = 'unscanned'` + +**Process:** +1. Claim a batch of rows (UPDATE scan_state = 'in_progress' WHERE scan_state = 'unscanned' LIMIT N RETURNING *) +2. For each icon URL: + - Make HTTP(S) GET request (normal Go HTTP client, DNS goes through Unbound) + - Enforce timeout (5s connect, 10s total) + - Enforce max download size (512KB — generous for icons) + - On success: validate it's an image (check magic bytes), decode to get dimensions + - Upload raw bytes to S3 `everytab-icons/{hash}` (content-addressed) + - Update `icons` row: s3_key, content_type, width, height, scan_state = 'completed' + - On failure: scan_state = 'failed', error = reason + +**Concurrency:** Maximize throughput — goroutine pool with configurable size (start at 1000, tune based on memory/bandwidth). Use semaphore pattern for backpressure. + +**Fast failure:** DNS errors, connection refused, timeouts all fail immediately (no retry for icons — if it's down, it's down). This keeps the long tail short. + +**Scaling to fleet:** If a single instance is insufficient: +- Multiple EC2 instances run the same binary +- Each claims work via the `scan_state` UPDATE (Postgres row-level locking prevents double-work) +- No coordination needed beyond the shared database + +**Stats emitted:** Icons attempted, completed, failed (by error type: DNS, timeout, HTTP error, invalid image, too large), download rate (icons/sec), bytes downloaded. + +### Stage 5: Best Icon Selection + +**Tool:** SQL query or small script + +**Process:** +For each host, select the best icon from its completed icons: +1. Filter to standard sizes: 16x16, 32x32, 48x48, 64x64 +2. Among those, pick the largest dimensions (prefer 64 > 48 > 32 > 16) +3. If no standard sizes found, pick the largest icon with dimensions <= 64px on both axes +4. If no icons at all, host gets a NULL best_icon_id (will use default in frontend) + +```sql +UPDATE hosts h SET best_icon_id = ( + SELECT id FROM icons i + WHERE i.host_id = h.id AND i.scan_state = 'completed' + ORDER BY + (width IN (16,32,48,64) AND height IN (16,32,48,64)) DESC, + width DESC + LIMIT 1 +); +``` + +**Stats emitted:** Hosts with icons, hosts without icons, icon size distribution. + +### Stage 6: Bundle Generation + +**Tool:** Custom Go program + +**Input:** `hosts` table (joined with their best icon from S3) + +**Process:** +1. Query all hosts where best_icon_id IS NOT NULL (or include no-icon hosts with a default flag) +2. Randomize the full result set (ORDER BY random() or shuffle in memory) +3. For each host: + - Download its best icon from S3 `everytab-icons` + - Decode the icon (ICO/GIF/PNG/etc.) + - For ICO files: extract the largest embedded image at a standard size <= 64x64 + - Re-encode as PNG (optimized compression) + - Base64-encode the PNG bytes +4. Chunk into groups of N entries (~500-700, tuned so each JSON is ~1MB) +5. Write each chunk as `tabs/{n}.json` to S3 `everytab-site` +6. Record total bundle count + +**Output:** +- `tabs/0000.json` through `tabs/{M}.json` in S3 +- Total bundle count M (used in frontend build) + +**Stats emitted:** Total bundles created, total hosts included, total hosts excluded (no icon), average bundle size, total S3 storage used. + +### Stage 7: Frontend Build + +**Tool:** Script/template that produces `index.html` + +**Process:** +1. Inject `TOTAL_BUNDLES` constant into the JS (baked at build time) +2. Minify if desired +3. Upload `index.html` to S3 `everytab-site` root + +### Stage 8: CloudFront Invalidation + +Invalidate `/*` on the CloudFront distribution so the new site is live. + +### Stage 9: Backup & Teardown + +**Process:** +1. Dump RDS database to local machine (homelab) — `pg_dump` over SSH tunnel or direct +2. Sync S3 `everytab-icons` to homelab storage — `aws s3 sync` +3. Confirm backups are complete +4. Delete RDS instance +5. Delete S3 `everytab-icons` bucket +6. Terminate EC2 instance + +## Frontend Architecture + +### Single-File Design + +One `index.html` containing inline CSS and JS. No external dependencies, no framework. Two HTTP requests per initial page load: +1. `GET /index.html` (HTML + CSS + JS, likely <50KB) +2. `GET /tabs/{random}.json` (~1MB, one bundle of ~500-700 tabs) + +### Tab Rendering + +- Tabs fill the viewport in rows, styled to mimic Firefox browser tabs (v1) +- Each row has a slight horizontal marquee animation (CSS) at varying speeds +- Tab density adapts to viewport width (responsive) +- Each tab shows: favicon (or blank for no-icon) + truncated title + +### Interaction + +- **Click tab (iframe_ok=true):** Opens an iframe overlay showing the actual site +- **Click tab (iframe_ok=false):** Opens site in a new tab (with external link indicator) +- **Close:** X button or click-away dismisses the iframe/overlay +- **Scroll down:** Triggers fetch of additional random bundles (infinite scroll) + +### Randomization + +- Seed: current UTC date (so everyone on the same day sees the same "shuffle", but it changes daily) +- Generate random bundle index in range [0, TOTAL_BUNDLES) +- Track fetched bundle IDs in a Set to avoid duplicates on scroll + +### No-Icon Hosts + +Hosts without a favicon are included in bundles with `"icon": null`. Frontend renders these Firefox-style: just the title text with no icon. This matches Firefox's behavior for tabs without favicons. + +## Cost Estimate + +### Scanning Phase (One-Time per Crawl) + +| Item | Estimate | +|------|----------| +| EC2 c5.xlarge (~24-48hrs) | $8-16 | +| RDS db.t3.medium (~48hrs) | $3-5 | +| S3 icons storage (temporary, ~500GB) | $12 (prorated to days) | +| S3 GET requests (30M WARC reads) | $12 | +| Data transfer (icon downloads, ~500GB inbound) | $0 (inbound is free) | +| **Total** | **~$35-45** | + +### Hosting Phase (Monthly Steady-State) + +| Item | Estimate | +|------|----------| +| S3 storage (~60GB bundles) | $1.40 | +| CloudFront (free tier: 1TB/month, 10M requests) | $0* | +| S3 requests (via CloudFront origin pulls, cached) | ~$1-5 | +| **Total** | **~$3-10/month** | + +*CloudFront free tier covers moderate traffic. Costs increase if the site goes viral, but that's a good problem to have. + +## Scaling Strategy + +### Development (100K domains) +- Single EC2 instance +- All stages complete in minutes-to-hours +- Good for validating the full pipeline end-to-end + +### Full Scan (30M domains) +- Single EC2 instance, high concurrency +- CC-Index query: <1hr +- WARC parsing: 2-6hrs (limited by S3 request rate) +- Icon download: 12-48hrs (limited by network + remote server response times) +- Bundle generation: 1-2hrs + +### Fleet Scaling (if needed) +- Spin up N identical EC2 instances running the icon downloader +- All share the same RDS instance +- Work claiming via Postgres atomic UPDATEs (no coordinator needed) +- Linear scaling: 4 instances = ~4x throughput + +## Key Design Decisions + +1. **Static-only hosting** — No servers running for the live site. Entire frontend is pre-built. +2. **Inline icons in bundles** — No per-icon requests. One bundle fetch gives you ~600 tabs to render. +3. **Unbound as system resolver** — Transparent to application code. Go HTTP client works normally; DNS just happens to resolve locally. +4. **Content-addressed icon storage** — S3 key is the content hash. Natural dedup at storage layer during scanning (but icons are duplicated across bundles for simplicity). +5. **Resumable pipeline** — Each stage uses database state (parsed, scan_state) to track progress. Crash and restart without re-doing completed work. +6. **PNG as universal icon format** — All icons converted to PNG for bundles regardless of source format. Smallest file size for small raster images, universally supported in browsers via data URIs. +7. **Date-seeded randomization** — Everyone visiting on the same day sees the same tab arrangement, creating a shared experience. Changes daily for freshness.