diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md
new file mode 100644
index 0000000..4eeb3ac
--- /dev/null
+++ b/ARCHITECTURE.md
@@ -0,0 +1,360 @@
+# EveryTab Architecture
+
+## System Overview
+
+EveryTab is a static website that displays a page full of browser tabs representing every website on the internet. The system has two phases:
+
+1. **Scanning Phase** — A data pipeline that extracts website metadata from Common Crawl, downloads favicons, and processes them into servable bundles.
+2. **Hosting Phase** — A static site served via S3 + CloudFront that renders tabs using pre-built JSON bundles.
+
+The scanning phase runs monthly (triggered by new Common Crawl releases), produces a static site, and then its infrastructure is torn down. The hosting phase runs indefinitely at minimal cost.
+
+```
+Common Crawl (S3)
+       |
+       v
+[EC2 + DuckDB] ---> [RDS Postgres] ---> [EC2 + Go programs] ---> S3 (icons/)
+       |                    |                     |                     |
+       |              (hosts, icons              |                     |
+       |               tables)                   v                     v
+       |                    |            [Bundle Generator] ---> S3 (tabs/*.json)
+       |                    |                                          |
+       |                    v                                          v
+       |             [Backup to homelab]                    S3 (index.html)
+       |                                                          |
+       v                                                          v
+  [Tear down EC2, RDS]                                     [CloudFront CDN]
+```
+
+## AWS Infrastructure
+
+All resources in **us-east-1**.
+
+| Resource | Purpose | Lifecycle |
+|----------|---------|-----------|
+| EC2 (xlarge, compute-optimized) | Run pipeline stages | Scanning only |
+| RDS Postgres (db.t3.medium) | Store hosts/icons metadata | Scanning only (backup then delete) |
+| S3 `everytab-icons` | Raw downloaded favicons | Scanning only (backup then delete) |
+| S3 `everytab-site` | Static site: index.html + tabs/*.json | Permanent |
+| CloudFront | CDN for static site | Permanent |
+| Unbound (on EC2) | Local recursive DNS resolver | Scanning only |
+
+### Steady-State (Hosting Only)
+- S3 `everytab-site` — stores index.html + ~50K JSON bundle files (~60GB total)
+- CloudFront distribution — serves the site with caching
+
+### Scanning Phase (Temporary)
+- EC2 instance — runs all processing (no persistent local storage needed beyond OS)
+- RDS — structured data store during pipeline execution
+- S3 `everytab-icons` — temporary storage for downloaded favicons
+
+## Data Model
+
+### `hosts` table
+
+| Column | Type | Description |
+|--------|------|-------------|
+| id | SERIAL PRIMARY KEY | Internal ID |
+| hostname | TEXT NOT NULL | e.g., `example.com` |
+| protocol | TEXT NOT NULL | `https` or `http` (prefer https) |
+| crawl_id | TEXT NOT NULL | CC crawl identifier (e.g., `CC-MAIN-2026-05`) |
+| warc_filename | TEXT NOT NULL | Path to WARC file in CC's S3 |
+| warc_record_offset | BIGINT NOT NULL | Byte offset into WARC file |
+| warc_record_length | INT NOT NULL | Length of WARC record |
+| html_title | TEXT | Extracted from `<title>` tag |
+| iframe_allowed | BOOLEAN | True if site allows framing (no X-Frame-Options/CSP restriction) |
+| best_icon_id | INT REFERENCES icons(id) | FK to the chosen icon for bundling |
+| parsed | BOOLEAN DEFAULT FALSE | Whether WARC has been parsed |
+
+**Constraints:** UNIQUE(hostname) — one row per domain, prefer https over http.
+
+### `icons` table
+
+| Column | Type | Description |
+|--------|------|-------------|
+| id | SERIAL PRIMARY KEY | Internal ID |
+| host_id | INT REFERENCES hosts(id) | FK to parent host |
+| url | TEXT NOT NULL | Full URL to the icon |
+| source | TEXT NOT NULL | `favicon_ico` or `link_rel` |
+| content_type | TEXT | MIME type after download (image/png, image/x-icon, etc.) |
+| width | INT | Decoded pixel width |
+| height | INT | Decoded pixel height |
+| s3_key | TEXT | Key in everytab-icons bucket |
+| scan_state | TEXT DEFAULT 'unscanned' | `unscanned`, `in_progress`, `completed`, `failed` |
+| error | TEXT | Error message if failed |
+
+**Indexes:**
+- `idx_icons_scan_state` on (scan_state) — for batch claiming work
+- `idx_icons_host_id` on (host_id) — for best-icon selection
+
+### Bundle JSON format (`tabs/0001.json`)
+
+```json
+{
+  "entries": [
+    {
+      "host": "example.com",
+      "title": "Example Domain",
+      "icon": "iVBORw0KGgo...",
+      "icon_w": 32,
+      "icon_h": 32,
+      "iframe_ok": true
+    }
+  ]
+}
+```
+
+Icons are stored inline as base64-encoded PNG. Each bundle targets ~1MB, yielding approximately 500-700 entries per bundle depending on icon sizes.
+
+## Pipeline Stages
+
+The pipeline is a series of manually-run scripts executed in order. Each stage is idempotent and resumable.
+
+### Stage 1: CC-Index Query
+
+**Tool:** DuckDB with httpfs extension (or local parquet if httpfs takes >1hr)
+
+**Input:** Common Crawl columnar index (parquet files on CC's S3)
+
+**Query logic:**
+```sql
+WHERE url_path = '/'
+  AND content_mime_type = 'text/html'
+  AND fetch_status = 200
+  AND url_query IS NULL
+  AND url_protocol IN ('http', 'https')
+  AND url_port IN (80, 443)
+```
+
+**Deduplication:** Per hostname, prefer `https` over `http`. Result is one row per unique hostname.
+
+**Output:** Populates `hosts` table in RDS (~30M rows for a full crawl).
+
+**Stats emitted:** Total domains found, https vs http breakdown, duplicates removed.
+
+### Stage 2: WARC Parsing
+
+**Tool:** Custom Go program, highly concurrent
+
+**Input:** `hosts` table rows where `parsed = FALSE`
+
+**Process:**
+1. Claim a batch of rows (set parsed = TRUE optimistically, or use a cursor)
+2. For each row, make a byte-range GET request to Common Crawl's S3:
+   - `Range: bytes={offset}-{offset+length-1}`
+   - Target: `s3://commoncrawl/{warc_filename}`
+3. Parse the WARC record to extract the HTTP response
+4. Parse HTML (defensively — handle malformed HTML, use a lenient parser):
+   - Extract `<title>` tag content
+   - Extract `<link rel="icon">` href values (filter to png/gif/ico, sizes 16-64px)
+   - Check HTTP response headers for `X-Frame-Options` and CSP `frame-ancestors`
+5. Insert a `/favicon.ico` entry into `icons` for every host (always attempt this)
+6. Insert any qualifying `link rel="icon"` entries into `icons`
+7. Update `hosts` row with `html_title`, `iframe_allowed`, `parsed = TRUE`
+
+**Concurrency:** High — thousands of goroutines. S3 byte-range requests are the bottleneck; S3 handles 5,500+ GET/s per prefix and WARC files are spread across many prefixes.
+
+**Error handling:** If HTML is unparseable, mark as parsed with NULL title. If WARC fetch fails, retry once then skip. Log all errors with hostname for investigation.
+
+**Stats emitted:** Rows processed, titles extracted, icons found (by type), iframe restrictions found, parse failures.
+
+### Stage 3: DNS Resolution Setup
+
+**Tool:** Unbound, installed and configured on EC2
+
+**Configuration:**
+- Recursive resolver (no forwarding to upstream)
+- Listening on 127.0.0.1:53
+- Aggressive caching enabled
+- High min-TTL (e.g., 3600s) to maximize cache hits across similar domains
+- Configured as system resolver in `/etc/resolv.conf`
+
+This runs as a background service. No separate "DNS resolution stage" — the Go icon downloader's HTTP requests transparently use Unbound via the OS resolver. Unbound handles recursive resolution and caching.
+
+**Why:** Downloading 30M+ icons without a local recursive resolver would overwhelm upstream DNS providers and likely get us rate-limited. Unbound resolves from root servers directly, caches aggressively, and handles the load locally.
+
+### Stage 4: Icon Download
+
+**Tool:** Custom Go program, highly concurrent
+
+**Input:** `icons` table rows where `scan_state = 'unscanned'`
+
+**Process:**
+1. Claim a batch of rows (UPDATE scan_state = 'in_progress' WHERE scan_state = 'unscanned' LIMIT N RETURNING *)
+2. For each icon URL:
+   - Make HTTP(S) GET request (normal Go HTTP client, DNS goes through Unbound)
+   - Enforce timeout (5s connect, 10s total)
+   - Enforce max download size (512KB — generous for icons)
+   - On success: validate it's an image (check magic bytes), decode to get dimensions
+   - Upload raw bytes to S3 `everytab-icons/{hash}` (content-addressed)
+   - Update `icons` row: s3_key, content_type, width, height, scan_state = 'completed'
+   - On failure: scan_state = 'failed', error = reason
+
+**Concurrency:** Maximize throughput — goroutine pool with configurable size (start at 1000, tune based on memory/bandwidth). Use semaphore pattern for backpressure.
+
+**Fast failure:** DNS errors, connection refused, timeouts all fail immediately (no retry for icons — if it's down, it's down). This keeps the long tail short.
+
+**Scaling to fleet:** If a single instance is insufficient:
+- Multiple EC2 instances run the same binary
+- Each claims work via the `scan_state` UPDATE (Postgres row-level locking prevents double-work)
+- No coordination needed beyond the shared database
+
+**Stats emitted:** Icons attempted, completed, failed (by error type: DNS, timeout, HTTP error, invalid image, too large), download rate (icons/sec), bytes downloaded.
+
+### Stage 5: Best Icon Selection
+
+**Tool:** SQL query or small script
+
+**Process:**
+For each host, select the best icon from its completed icons:
+1. Filter to standard sizes: 16x16, 32x32, 48x48, 64x64
+2. Among those, pick the largest dimensions (prefer 64 > 48 > 32 > 16)
+3. If no standard sizes found, pick the largest icon with dimensions <= 64px on both axes
+4. If no icons at all, host gets a NULL best_icon_id (will use default in frontend)
+
+```sql
+UPDATE hosts h SET best_icon_id = (
+  SELECT id FROM icons i
+  WHERE i.host_id = h.id AND i.scan_state = 'completed'
+  ORDER BY
+    (width IN (16,32,48,64) AND height IN (16,32,48,64)) DESC,
+    width DESC
+  LIMIT 1
+);
+```
+
+**Stats emitted:** Hosts with icons, hosts without icons, icon size distribution.
+
+### Stage 6: Bundle Generation
+
+**Tool:** Custom Go program
+
+**Input:** `hosts` table (joined with their best icon from S3)
+
+**Process:**
+1. Query all hosts where best_icon_id IS NOT NULL (or include no-icon hosts with a default flag)
+2. Randomize the full result set (ORDER BY random() or shuffle in memory)
+3. For each host:
+   - Download its best icon from S3 `everytab-icons`
+   - Decode the icon (ICO/GIF/PNG/etc.)
+   - For ICO files: extract the largest embedded image at a standard size <= 64x64
+   - Re-encode as PNG (optimized compression)
+   - Base64-encode the PNG bytes
+4. Chunk into groups of N entries (~500-700, tuned so each JSON is ~1MB)
+5. Write each chunk as `tabs/{n}.json` to S3 `everytab-site`
+6. Record total bundle count
+
+**Output:**
+- `tabs/0000.json` through `tabs/{M}.json` in S3
+- Total bundle count M (used in frontend build)
+
+**Stats emitted:** Total bundles created, total hosts included, total hosts excluded (no icon), average bundle size, total S3 storage used.
+
+### Stage 7: Frontend Build
+
+**Tool:** Script/template that produces `index.html`
+
+**Process:**
+1. Inject `TOTAL_BUNDLES` constant into the JS (baked at build time)
+2. Minify if desired
+3. Upload `index.html` to S3 `everytab-site` root
+
+### Stage 8: CloudFront Invalidation
+
+Invalidate `/*` on the CloudFront distribution so the new site is live.
+
+### Stage 9: Backup & Teardown
+
+**Process:**
+1. Dump RDS database to local machine (homelab) — `pg_dump` over SSH tunnel or direct
+2. Sync S3 `everytab-icons` to homelab storage — `aws s3 sync`
+3. Confirm backups are complete
+4. Delete RDS instance
+5. Delete S3 `everytab-icons` bucket
+6. Terminate EC2 instance
+
+## Frontend Architecture
+
+### Single-File Design
+
+One `index.html` containing inline CSS and JS. No external dependencies, no framework. Two HTTP requests per initial page load:
+1. `GET /index.html` (HTML + CSS + JS, likely <50KB)
+2. `GET /tabs/{random}.json` (~1MB, one bundle of ~500-700 tabs)
+
+### Tab Rendering
+
+- Tabs fill the viewport in rows, styled to mimic Firefox browser tabs (v1)
+- Each row has a slight horizontal marquee animation (CSS) at varying speeds
+- Tab density adapts to viewport width (responsive)
+- Each tab shows: favicon (or blank for no-icon) + truncated title
+
+### Interaction
+
+- **Click tab (iframe_ok=true):** Opens an iframe overlay showing the actual site
+- **Click tab (iframe_ok=false):** Opens site in a new tab (with external link indicator)
+- **Close:** X button or click-away dismisses the iframe/overlay
+- **Scroll down:** Triggers fetch of additional random bundles (infinite scroll)
+
+### Randomization
+
+- Seed: current UTC date (so everyone on the same day sees the same "shuffle", but it changes daily)
+- Generate random bundle index in range [0, TOTAL_BUNDLES)
+- Track fetched bundle IDs in a Set to avoid duplicates on scroll
+
+### No-Icon Hosts
+
+Hosts without a favicon are included in bundles with `"icon": null`. Frontend renders these Firefox-style: just the title text with no icon. This matches Firefox's behavior for tabs without favicons.
+
+## Cost Estimate
+
+### Scanning Phase (One-Time per Crawl)
+
+| Item | Estimate |
+|------|----------|
+| EC2 c5.xlarge (~24-48hrs) | $8-16 |
+| RDS db.t3.medium (~48hrs) | $3-5 |
+| S3 icons storage (temporary, ~500GB) | $12 (prorated to days) |
+| S3 GET requests (30M WARC reads) | $12 |
+| Data transfer (icon downloads, ~500GB inbound) | $0 (inbound is free) |
+| **Total** | **~$35-45** |
+
+### Hosting Phase (Monthly Steady-State)
+
+| Item | Estimate |
+|------|----------|
+| S3 storage (~60GB bundles) | $1.40 |
+| CloudFront (free tier: 1TB/month, 10M requests) | $0* |
+| S3 requests (via CloudFront origin pulls, cached) | ~$1-5 |
+| **Total** | **~$3-10/month** |
+
+*CloudFront free tier covers moderate traffic. Costs increase if the site goes viral, but that's a good problem to have.
+
+## Scaling Strategy
+
+### Development (100K domains)
+- Single EC2 instance
+- All stages complete in minutes-to-hours
+- Good for validating the full pipeline end-to-end
+
+### Full Scan (30M domains)
+- Single EC2 instance, high concurrency
+- CC-Index query: <1hr
+- WARC parsing: 2-6hrs (limited by S3 request rate)
+- Icon download: 12-48hrs (limited by network + remote server response times)
+- Bundle generation: 1-2hrs
+
+### Fleet Scaling (if needed)
+- Spin up N identical EC2 instances running the icon downloader
+- All share the same RDS instance
+- Work claiming via Postgres atomic UPDATEs (no coordinator needed)
+- Linear scaling: 4 instances = ~4x throughput
+
+## Key Design Decisions
+
+1. **Static-only hosting** — No servers running for the live site. Entire frontend is pre-built.
+2. **Inline icons in bundles** — No per-icon requests. One bundle fetch gives you ~600 tabs to render.
+3. **Unbound as system resolver** — Transparent to application code. Go HTTP client works normally; DNS just happens to resolve locally.
+4. **Content-addressed icon storage** — S3 key is the content hash. Natural dedup at storage layer during scanning (but icons are duplicated across bundles for simplicity).
+5. **Resumable pipeline** — Each stage uses database state (parsed, scan_state) to track progress. Crash and restart without re-doing completed work.
+6. **PNG as universal icon format** — All icons converted to PNG for bundles regardless of source format. Smallest file size for small raster images, universally supported in browsers via data URIs.
+7. **Date-seeded randomization** — Everyone visiting on the same day sees the same tab arrangement, creating a shared experience. Changes daily for freshness.