everytab/ARCHITECTURE.md

# EveryTab Architecture

## System Overview

EveryTab is a static website that displays a page full of browser tabs representing every website on the internet. The system has two phases:

1. **Scanning Phase** — A data pipeline that extracts website metadata from Common Crawl, downloads favicons, and processes them into servable bundles.
2. **Hosting Phase** — A static site served via S3 + CloudFront that renders tabs using pre-built JSON bundles.

The scanning phase runs monthly (triggered by new Common Crawl releases), produces a static site, and then its infrastructure is torn down. The hosting phase runs indefinitely at minimal cost.

```
Common Crawl (S3)
       |
       v
[EC2 + DuckDB] ---> [RDS Postgres] ---> [EC2 + Go programs] ---> S3 (icons/)
       |                    |                     |                     |
       |              (hosts, icons              |                     |
       |               tables)                   v                     v
       |                    |            [Bundle Generator] ---> S3 (tabs/*.json)
       |                    |                                          |
       |                    v                                          v
       |             [Backup to homelab]                    S3 (index.html)
       |                                                          |
       v                                                          v
  [Tear down EC2, RDS]                                     [CloudFront CDN]
```

## AWS Infrastructure

All resources in **us-east-1**.

| Resource | Purpose | Lifecycle |
|----------|---------|-----------|
| EC2 (xlarge, compute-optimized) | Run pipeline stages | Scanning only |
| RDS Postgres (db.t3.medium) | Store hosts/icons metadata | Scanning only (backup then delete) |
| S3 `everytab-icons` | Raw downloaded favicons | Scanning only (backup then delete) |
| S3 `everytab-site` | Static site: index.html + tabs/*.json | Permanent |
| CloudFront | CDN for static site | Permanent |
| Unbound (on EC2) | Local recursive DNS resolver | Scanning only |

### Steady-State (Hosting Only)
- S3 `everytab-site` — stores index.html + ~50K JSON bundle files (~60GB total)
- CloudFront distribution — serves the site with caching

### Scanning Phase (Temporary)
- EC2 instance — runs all processing (no persistent local storage needed beyond OS)
- RDS — structured data store during pipeline execution
- S3 `everytab-icons` — temporary storage for downloaded favicons

## Data Model

### `hosts` table

| Column | Type | Description |
|--------|------|-------------|
| id | SERIAL PRIMARY KEY | Internal ID |
| hostname | TEXT NOT NULL | e.g., `example.com` |
| protocol | TEXT NOT NULL | `https` or `http` (prefer https) |
| crawl_id | TEXT NOT NULL | CC crawl identifier (e.g., `CC-MAIN-2026-05`) |
| warc_filename | TEXT NOT NULL | Path to WARC file in CC's S3 |
| warc_record_offset | BIGINT NOT NULL | Byte offset into WARC file |
| warc_record_length | INT NOT NULL | Length of WARC record |
| html_title | TEXT | Extracted from `<title>` tag |
| iframe_allowed | BOOLEAN | True if site allows framing (no X-Frame-Options/CSP restriction) |
| best_icon_id | INT REFERENCES icons(id) | FK to the chosen icon for bundling |
| parsed | BOOLEAN DEFAULT FALSE | Whether WARC has been parsed |

**Constraints:** UNIQUE(hostname) — one row per domain, prefer https over http.

### `icons` table

| Column | Type | Description |
|--------|------|-------------|
| id | SERIAL PRIMARY KEY | Internal ID |
| host_id | INT REFERENCES hosts(id) | FK to parent host |
| url | TEXT NOT NULL | Full URL to the icon |
| source | TEXT NOT NULL | `favicon_ico` or `link_rel` |
| content_type | TEXT | MIME type after download (image/png, image/x-icon, etc.) |
| width | INT | Decoded pixel width |
| height | INT | Decoded pixel height |
| s3_key | TEXT | Key in everytab-icons bucket |
| scan_state | TEXT DEFAULT 'unscanned' | `unscanned`, `in_progress`, `completed`, `failed` |
| error | TEXT | Error message if failed |

**Indexes:**
- `idx_icons_scan_state` on (scan_state) — for batch claiming work
- `idx_icons_host_id` on (host_id) — for best-icon selection

### Bundle JSON format (`tabs/0001.json`)

```json
{
  "entries": [
    {
      "host": "example.com",
      "title": "Example Domain",
      "icon": "iVBORw0KGgo...",
      "icon_w": 32,
      "icon_h": 32,
      "iframe_ok": true
    }
  ]
}
```

Icons are stored inline as base64-encoded PNG. Each bundle targets ~1MB, yielding approximately 500-700 entries per bundle depending on icon sizes.

## Pipeline Stages

The pipeline is a series of manually-run scripts executed in order. Each stage is idempotent and resumable.

### Stage 1: CC-Index Query

**Tool:** DuckDB with httpfs extension (or local parquet if httpfs takes >1hr)

**Input:** Common Crawl columnar index (parquet files on CC's S3)

**Query logic:**
```sql
WHERE url_path = '/'
  AND content_mime_type = 'text/html'
  AND fetch_status = 200
  AND url_query IS NULL
  AND url_protocol IN ('http', 'https')
  AND url_port IN (80, 443)
```

**Deduplication:** Per hostname, prefer `https` over `http`. Result is one row per unique hostname.

**Output:** Populates `hosts` table in RDS (~30M rows for a full crawl).

**Stats emitted:** Total domains found, https vs http breakdown, duplicates removed.

### Stage 2: WARC Parsing

**Tool:** Custom Go program, highly concurrent

**Input:** `hosts` table rows where `parsed = FALSE`

**Process:**
1. Claim a batch of rows (set parsed = TRUE optimistically, or use a cursor)
2. For each row, make a byte-range GET request to Common Crawl's S3:
   - `Range: bytes={offset}-{offset+length-1}`
   - Target: `s3://commoncrawl/{warc_filename}`
3. Parse the WARC record to extract the HTTP response
4. Parse HTML (defensively — handle malformed HTML, use a lenient parser):
   - Extract `<title>` tag content
   - Extract `<link rel="icon">` href values (filter to png/gif/ico, sizes 16-64px)
   - Check HTTP response headers for `X-Frame-Options` and CSP `frame-ancestors`
5. Insert a `/favicon.ico` entry into `icons` for every host (always attempt this)
6. Insert any qualifying `link rel="icon"` entries into `icons`
7. Update `hosts` row with `html_title`, `iframe_allowed`, `parsed = TRUE`

**Concurrency:** High — thousands of goroutines. S3 byte-range requests are the bottleneck; S3 handles 5,500+ GET/s per prefix and WARC files are spread across many prefixes.

**Error handling:** If HTML is unparseable, mark as parsed with NULL title. If WARC fetch fails, retry once then skip. Log all errors with hostname for investigation.

**Stats emitted:** Rows processed, titles extracted, icons found (by type), iframe restrictions found, parse failures.

### Stage 3: DNS Resolution Setup

**Tool:** Unbound, installed and configured on EC2

**Configuration:**
- Recursive resolver (no forwarding to upstream)
- Listening on 127.0.0.1:53
- Aggressive caching enabled
- High min-TTL (e.g., 3600s) to maximize cache hits across similar domains
- Configured as system resolver in `/etc/resolv.conf`

This runs as a background service. No separate "DNS resolution stage" — the Go icon downloader's HTTP requests transparently use Unbound via the OS resolver. Unbound handles recursive resolution and caching.

**Why:** Downloading 30M+ icons without a local recursive resolver would overwhelm upstream DNS providers and likely get us rate-limited. Unbound resolves from root servers directly, caches aggressively, and handles the load locally.

### Stage 4: Icon Download

**Tool:** Custom Go program, highly concurrent

**Input:** `icons` table rows where `scan_state = 'unscanned'`

**Process:**
1. Claim a batch of rows (UPDATE scan_state = 'in_progress' WHERE scan_state = 'unscanned' LIMIT N RETURNING *)
2. For each icon URL:
   - Make HTTP(S) GET request (normal Go HTTP client, DNS goes through Unbound)
   - Enforce timeout (5s connect, 10s total)
   - Enforce max download size (512KB — generous for icons)
   - On success: validate it's an image (check magic bytes), decode to get dimensions
   - Upload raw bytes to S3 `everytab-icons/{hash}` (content-addressed)
   - Update `icons` row: s3_key, content_type, width, height, scan_state = 'completed'
   - On failure: scan_state = 'failed', error = reason

**Concurrency:** Maximize throughput — goroutine pool with configurable size (start at 1000, tune based on memory/bandwidth). Use semaphore pattern for backpressure.

**Fast failure:** DNS errors, connection refused, timeouts all fail immediately (no retry for icons — if it's down, it's down). This keeps the long tail short.

**Scaling to fleet:** If a single instance is insufficient:
- Multiple EC2 instances run the same binary
- Each claims work via the `scan_state` UPDATE (Postgres row-level locking prevents double-work)
- No coordination needed beyond the shared database

**Stats emitted:** Icons attempted, completed, failed (by error type: DNS, timeout, HTTP error, invalid image, too large), download rate (icons/sec), bytes downloaded.

### Stage 5: Best Icon Selection

**Tool:** SQL query or small script

**Process:**
For each host, select the best icon from its completed icons:
1. Filter to standard sizes: 16x16, 32x32, 48x48, 64x64
2. Among those, pick the largest dimensions (prefer 64 > 48 > 32 > 16)
3. If no standard sizes found, pick the largest icon with dimensions <= 64px on both axes
4. If no icons at all, host gets a NULL best_icon_id (will use default in frontend)

```sql
UPDATE hosts h SET best_icon_id = (
  SELECT id FROM icons i
  WHERE i.host_id = h.id AND i.scan_state = 'completed'
  ORDER BY
    (width IN (16,32,48,64) AND height IN (16,32,48,64)) DESC,
    width DESC
  LIMIT 1
);
```

**Stats emitted:** Hosts with icons, hosts without icons, icon size distribution.

### Stage 6: Bundle Generation

**Tool:** Custom Go program

**Input:** `hosts` table (joined with their best icon from S3)

**Process:**
1. Query all hosts where best_icon_id IS NOT NULL (or include no-icon hosts with a default flag)
2. Randomize the full result set (ORDER BY random() or shuffle in memory)
3. For each host:
   - Download its best icon from S3 `everytab-icons`
   - Decode the icon (ICO/GIF/PNG/etc.)
   - For ICO files: extract the largest embedded image at a standard size <= 64x64
   - Re-encode as PNG (optimized compression)
   - Base64-encode the PNG bytes
4. Chunk into groups of N entries (~500-700, tuned so each JSON is ~1MB)
5. Write each chunk as `tabs/{n}.json` to S3 `everytab-site`
6. Record total bundle count

**Output:**
- `tabs/0000.json` through `tabs/{M}.json` in S3
- Total bundle count M (used in frontend build)

**Stats emitted:** Total bundles created, total hosts included, total hosts excluded (no icon), average bundle size, total S3 storage used.

### Stage 7: Frontend Build

**Tool:** Script/template that produces `index.html`

**Process:**
1. Inject `TOTAL_BUNDLES` constant into the JS (baked at build time)
2. Minify if desired
3. Upload `index.html` to S3 `everytab-site` root

### Stage 8: CloudFront Invalidation

Invalidate `/*` on the CloudFront distribution so the new site is live.

### Stage 9: Backup & Teardown

**Process:**
1. Dump RDS database to local machine (homelab) — `pg_dump` over SSH tunnel or direct
2. Sync S3 `everytab-icons` to homelab storage — `aws s3 sync`
3. Confirm backups are complete
4. Delete RDS instance
5. Delete S3 `everytab-icons` bucket
6. Terminate EC2 instance

## Frontend Architecture

### Single-File Design

One `index.html` containing inline CSS and JS. No external dependencies, no framework. Two HTTP requests per initial page load:
1. `GET /index.html` (HTML + CSS + JS, likely <50KB)
2. `GET /tabs/{random}.json` (~1MB, one bundle of ~500-700 tabs)

### Tab Rendering

- Tabs fill the viewport in rows, styled to mimic Firefox browser tabs (v1)
- Each row has a slight horizontal marquee animation (CSS) at varying speeds
- Tab density adapts to viewport width (responsive)
- Each tab shows: favicon (or blank for no-icon) + truncated title

### Interaction

- **Click tab (iframe_ok=true):** Opens an iframe overlay showing the actual site
- **Click tab (iframe_ok=false):** Opens site in a new tab (with external link indicator)
- **Close:** X button or click-away dismisses the iframe/overlay
- **Scroll down:** Triggers fetch of additional random bundles (infinite scroll)

### Randomization

- Seed: current UTC date (so everyone on the same day sees the same "shuffle", but it changes daily)
- Generate random bundle index in range [0, TOTAL_BUNDLES)
- Track fetched bundle IDs in a Set to avoid duplicates on scroll

### No-Icon Hosts

Hosts without a favicon are included in bundles with `"icon": null`. Frontend renders these Firefox-style: just the title text with no icon. This matches Firefox's behavior for tabs without favicons.

## Cost Estimate

### Scanning Phase (One-Time per Crawl)

| Item | Estimate |
|------|----------|
| EC2 c5.xlarge (~24-48hrs) | $8-16 |
| RDS db.t3.medium (~48hrs) | $3-5 |
| S3 icons storage (temporary, ~500GB) | $12 (prorated to days) |
| S3 GET requests (30M WARC reads) | $12 |
| Data transfer (icon downloads, ~500GB inbound) | $0 (inbound is free) |
| **Total** | **~$35-45** |

### Hosting Phase (Monthly Steady-State)

| Item | Estimate |
|------|----------|
| S3 storage (~60GB bundles) | $1.40 |
| CloudFront (free tier: 1TB/month, 10M requests) | $0* |
| S3 requests (via CloudFront origin pulls, cached) | ~$1-5 |
| **Total** | **~$3-10/month** |

*CloudFront free tier covers moderate traffic. Costs increase if the site goes viral, but that's a good problem to have.

## Scaling Strategy

### Development (100K domains)
- Single EC2 instance
- All stages complete in minutes-to-hours
- Good for validating the full pipeline end-to-end

### Full Scan (30M domains)
- Single EC2 instance, high concurrency
- CC-Index query: <1hr
- WARC parsing: 2-6hrs (limited by S3 request rate)
- Icon download: 12-48hrs (limited by network + remote server response times)
- Bundle generation: 1-2hrs

### Fleet Scaling (if needed)
- Spin up N identical EC2 instances running the icon downloader
- All share the same RDS instance
- Work claiming via Postgres atomic UPDATEs (no coordinator needed)
- Linear scaling: 4 instances = ~4x throughput

## Key Design Decisions

1. **Static-only hosting** — No servers running for the live site. Entire frontend is pre-built.
2. **Inline icons in bundles** — No per-icon requests. One bundle fetch gives you ~600 tabs to render.
3. **Unbound as system resolver** — Transparent to application code. Go HTTP client works normally; DNS just happens to resolve locally.
4. **Content-addressed icon storage** — S3 key is the content hash. Natural dedup at storage layer during scanning (but icons are duplicated across bundles for simplicity).
5. **Resumable pipeline** — Each stage uses database state (parsed, scan_state) to track progress. Crash and restart without re-doing completed work.
6. **PNG as universal icon format** — All icons converted to PNG for bundles regardless of source format. Smallest file size for small raster images, universally supported in browsers via data URIs.
7. **Date-seeded randomization** — Everyone visiting on the same day sees the same tab arrangement, creating a shared experience. Changes daily for freshness.