added PLAN.md with initial dev plan

2026-05-17 14:00:14 -04:00 · 2026-05-17 14:00:14 -04:00 · c50be97fd7
commit c50be97fd7
parent a327fb3db3
1 changed files with 687 additions and 0 deletions
--- a/PLAN.md
+++ b/PLAN.md
@ -0,0 +1,687 @@
+# EveryTab Implementation Plan
+
+This plan builds the system described in ARCHITECTURE.md in incremental steps. We start with 100K hosts to validate the pipeline end-to-end, then scale to the full ~30M.
+
+Each step has a clear deliverable and validation criteria. Steps within a phase are sequential; some phases can overlap (noted where applicable).
+
+---
+
+## Phase 0: Project Setup & AWS Infrastructure
+
+### Step 0.1: Repository Structure
+
+Create the project layout:
+
+```
+everytab/
+├── design.md
+├── ARCHITECTURE.md
+├── PLAN.md
+├── infra/               # AWS CLI scripts for setup/teardown
+│   ├── setup.sh         # Create RDS, S3 buckets, security groups
+│   ├── teardown.sh      # Delete non-permanent resources
+│   └── ec2-userdata.sh  # EC2 bootstrap (install Go, DuckDB, Unbound)
+├── pipeline/
+│   ├── 01_cc_index/     # DuckDB query scripts
+│   ├── 02_warc_parse/   # Go program
+│   ├── 03_icon_download/# Go program
+│   ├── 04_best_icon/    # SQL script
+│   ├── 05_bundle_gen/   # Go program
+│   └── 06_frontend/     # Build script, templates
+├── frontend/
+│   ├── index.html
+│   └── site.js
+├── stats/               # Stats output from each stage (gitignored)
+└── go.mod               # Shared Go module for pipeline programs
+```
+
+**Done when:** Repo structure exists, `go.mod` initialized, `.gitignore` covers stats/ and any local config.
+
+### Step 0.2: AWS Infrastructure (Manual CLI)
+
+Create resources using AWS CLI commands in `infra/setup.sh`:
+
+1. **S3 buckets:**
+   - `everytab-icons` (private, no public access)
+   - `everytab-site` (private, accessed via CloudFront OAC)
+
+2. **RDS Postgres:**
+   - `db.t3.medium`, 20GB storage (expandable), Postgres 16
+   - In a VPC, security group allows inbound 5432 from EC2 security group
+   - No public access (EC2 connects within VPC)
+   - No multi-AZ (dev, not production)
+   - Set a strong password, store in a local `.env` (gitignored)
+
+3. **EC2 instance:**
+   - `c5.xlarge` (4 vCPU, 8GB RAM) — enough for Go concurrency + Unbound cache
+   - Amazon Linux 2023 or Ubuntu 24.04
+   - Security group: allow SSH (from your IP), allow outbound all
+   - Same VPC/subnet as RDS
+   - Key pair for SSH access
+
+4. **CloudFront distribution:**
+   - Origin: `everytab-site` S3 bucket (OAC)
+   - Default cache behavior: cache everything, Brotli+Gzip compression
+   - Can set up now or defer to Phase 2
+
+5. **IAM role for EC2:**
+   - S3 read/write to both buckets
+   - Attach as instance profile
+
+**Validation:** SSH into EC2, confirm `psql` can connect to RDS, confirm `aws s3 ls` shows both buckets.
+
+**Done when:** All resources exist, EC2 can reach RDS and S3.
+
+### Step 0.3: EC2 Environment Setup
+
+Bootstrap script (`infra/ec2-userdata.sh` or run manually):
+
+1. Install Go (latest stable, 1.22+)
+2. Install DuckDB CLI
+3. Install Unbound, configure as recursive resolver:
+   - `/etc/unbound/unbound.conf`: recursive mode, no forwarding, listen on 127.0.0.1
+   - High cache: `msg-cache-size: 512m`, `rrset-cache-size: 1g`
+   - `cache-min-ttl: 3600`
+   - `prefetch: yes`
+   - `num-threads: 4`
+4. Set `/etc/resolv.conf` → `nameserver 127.0.0.1`
+5. Install `psql` client, `pg_dump`
+6. Confirm DuckDB httpfs extension works: `INSTALL httpfs; LOAD httpfs;`
+
+**Validation:**
+- `go version` works
+- `duckdb -c "INSTALL httpfs; LOAD httpfs; SELECT 1;"` works
+- `dig example.com @127.0.0.1` resolves (Unbound working)
+- `psql $DATABASE_URL -c "SELECT 1;"` connects to RDS
+
+**Done when:** EC2 is a working development environment for all pipeline stages.
+
+---
+
+## Phase 1: CC-Index Query (Stage 1)
+
+### Step 1.1: Database Schema
+
+Create the Postgres tables. Run via `psql`:
+
+```sql
+CREATE TABLE hosts (
+    id SERIAL PRIMARY KEY,
+    hostname TEXT NOT NULL UNIQUE,
+    protocol TEXT NOT NULL,
+    crawl_id TEXT NOT NULL,
+    warc_filename TEXT NOT NULL,
+    warc_record_offset BIGINT NOT NULL,
+    warc_record_length INT NOT NULL,
+    html_title TEXT,
+    iframe_allowed BOOLEAN,
+    best_icon_s3_key TEXT,
+    parsed BOOLEAN DEFAULT FALSE
+);
+
+CREATE TABLE icons (
+    id SERIAL PRIMARY KEY,
+    host_id INT NOT NULL REFERENCES hosts(id),
+    url TEXT NOT NULL,
+    source TEXT NOT NULL,
+    rel_type TEXT,
+    rel_sizes TEXT,
+    content_type TEXT,
+    width INT,
+    height INT,
+    file_size INT,
+    s3_key TEXT,
+    scan_state TEXT DEFAULT 'unscanned',
+    error TEXT
+);
+
+CREATE INDEX idx_hosts_parsed ON hosts(id) WHERE parsed = FALSE;
+CREATE INDEX idx_icons_unscanned ON icons(id) WHERE scan_state = 'unscanned';
+CREATE INDEX idx_icons_host_id ON icons(host_id);
+```
+
+**Done when:** Tables exist in RDS, schema matches ARCHITECTURE.md.
+
+### Step 1.2: DuckDB CC-Index Query (100K limit)
+
+Write `pipeline/01_cc_index/query.sql` (or a shell script wrapping DuckDB CLI).
+
+The script:
+1. Connects DuckDB to RDS via the postgres extension
+2. Queries the CC-Index parquet files via httpfs (latest crawl)
+3. Filters per ARCHITECTURE.md criteria
+4. Deduplicates per hostname (prefer https)
+5. Limits to 100,000 rows for dev
+6. Inserts directly into the hosts table
+
+Key considerations:
+- Find the latest crawl index path (e.g., `s3://commoncrawl/cc-index/collections/CC-MAIN-2026-05/indexes/cdx-00*.parquet` — verify actual path structure)
+- DuckDB postgres extension: `INSTALL postgres; LOAD postgres; ATTACH 'dbname=... host=... ...' AS pg (TYPE POSTGRES);`
+- The dedup logic: partition by hostname, order by protocol (https first), take first row
+- Add `LIMIT 100000` for dev, remove for full run
+- Time the query — if httpfs takes >1hr, switch to downloading parquet first
+
+**Validation:**
+- `SELECT COUNT(*) FROM hosts;` returns ~100,000
+- `SELECT protocol, COUNT(*) FROM hosts GROUP BY protocol;` shows mostly https
+- `SELECT * FROM hosts LIMIT 5;` shows sane data (real hostnames, valid WARC paths)
+- Spot-check: pick a few hostnames, verify they're real websites
+
+**Stats to emit:** `stats/01_cc_index.json` with total_domains, https_count, http_count, query_time_seconds.
+
+**Done when:** 100K hosts in the database with valid WARC coordinates.
+
+### Step 1.3: Validate WARC Coordinates
+
+Quick sanity check — before writing the full WARC parser, confirm we can actually fetch WARC records:
+
+```bash
+# Pick a random row
+psql -c "SELECT warc_filename, warc_record_offset, warc_record_length FROM hosts ORDER BY random() LIMIT 1;"
+
+# Fetch it with curl byte-range
+curl -r $OFFSET-$((OFFSET + LENGTH - 1)) "https://data.commoncrawl.org/$WARC_FILENAME" | head -c 500
+```
+
+Should see a WARC record header followed by HTTP response headers and HTML.
+
+**Done when:** We can manually fetch 3-5 WARC records and see valid HTML content.
+
+---
+
+## Phase 2: WARC Parsing (Stage 2)
+
+### Step 2.1: Go Project Setup
+
+Set up the shared Go module and the WARC parser binary:
+
+```
+pipeline/02_warc_parse/
+├── main.go          # Entry point, CLI flags, orchestration
+├── warc.go          # WARC record fetching (S3 byte-range)
+├── parser.go        # HTML parsing (title, link rel=icon, iframe headers)
+└── db.go            # Postgres batch read/write
+```
+
+Dependencies:
+- `github.com/jackc/pgx/v5` — Postgres driver (pool, batch operations)
+- `golang.org/x/net/html` — Lenient HTML parser
+- Standard library `net/http` for S3 byte-range requests
+
+CLI flags:
+- `--db` connection string
+- `--batch-size` (default 500)
+- `--concurrency` (default 1000)
+- `--dry-run` (print parsed results, don't write to DB)
+- `--limit` (process at most N rows, for testing)
+
+**Done when:** Project compiles, connects to DB, can read a batch of hosts rows.
+
+### Step 2.2: WARC Fetch + Parse Logic
+
+Implement:
+1. Byte-range fetch from `https://data.commoncrawl.org/{warc_filename}`
+2. Parse WARC record envelope (find the HTTP response within)
+3. Extract HTTP response headers:
+   - `X-Frame-Options` → if present and not `ALLOWALL`, iframe_allowed = false
+   - `Content-Security-Policy` → check for `frame-ancestors` directive
+4. Parse HTML body:
+   - Extract `<title>` content (first title tag, truncate at 512 chars)
+   - Extract all `<link rel="icon">` and `<link rel="shortcut icon">`:
+     - href (resolve relative URLs against `{protocol}://{hostname}/`)
+     - type attribute (if present)
+     - sizes attribute (if present)
+   - Ignore data: URIs, ignore links to other domains' icons for now
+
+**Dry-run test:** Run with `--limit 100 --dry-run` and inspect output. Check:
+- Titles look reasonable (not empty, not garbage)
+- Icon URLs are well-formed (absolute, correct protocol)
+- iframe_allowed is set correctly (spot-check against real sites)
+
+**Done when:** Can parse 100 WARC records correctly with `--dry-run` showing reasonable output.
+
+### Step 2.3: Batch DB Writes + Full 100K Run
+
+Implement the database write path:
+1. For each parsed host: UPDATE hosts SET html_title, iframe_allowed, parsed = TRUE
+2. For each host: INSERT INTO icons (host_id, url, source='favicon_ico') for `/favicon.ico`
+3. For each discovered link rel=icon: INSERT INTO icons (host_id, url, source='link_rel', rel_type, rel_sizes)
+4. Use batch/bulk operations (pgx CopyFrom or batch INSERT)
+
+Run against the full 100K hosts:
+- Monitor throughput (hosts/sec)
+- Watch for errors (log to stderr)
+
+**Validation:**
+- `SELECT COUNT(*) FROM hosts WHERE parsed = TRUE;` should approach 100,000
+- `SELECT COUNT(*) FROM icons;` should be > 100,000 (at minimum one /favicon.ico per host)
+- `SELECT COUNT(*) FROM hosts WHERE html_title IS NOT NULL;` — expect 90%+
+- `SELECT source, COUNT(*) FROM icons GROUP BY source;` — see the split
+- `SELECT COUNT(*) FROM hosts WHERE iframe_allowed = FALSE;` — expect 30-50%
+- Spot-check: pick some hosts, verify title matches the actual site
+
+**Stats:** `stats/02_warc_parse.json`
+
+**Done when:** All 100K hosts parsed, icons table populated, stats look reasonable.
+
+---
+
+## Phase 3: Icon Download (Stage 3)
+
+### Step 3.1: Icon Downloader Go Program
+
+```
+pipeline/03_icon_download/
+├── main.go          # Entry point, CLI flags, worker pool
+├── downloader.go    # HTTP fetch with timeouts, size limits
+├── decoder.go       # Image validation + dimension extraction
+├── s3.go            # Upload to everytab-icons bucket
+└── db.go            # Claim work, update results
+```
+
+CLI flags:
+- `--db` connection string
+- `--s3-bucket` (default `everytab-icons`)
+- `--concurrency` (default 1000, tunable)
+- `--batch-size` (default 500)
+- `--timeout` (default 10s)
+- `--max-size` (default 512KB)
+- `--dry-run` (fetch and validate but don't upload to S3 or update DB)
+- `--limit` (process at most N icons)
+
+Dependencies:
+- `github.com/jackc/pgx/v5` — Postgres
+- `github.com/aws/aws-sdk-go-v2` — S3 uploads
+- Standard library `image` + sub-packages for decoding dimensions
+- A library for ICO parsing (e.g., `github.com/AvraamMavridis/randomcolor` — actually find a proper ICO decoder, or write a simple one that reads the ICO header for directory entries)
+
+### Step 3.2: Work Claiming + Download Logic
+
+Implement:
+1. Claim batch with randomized order (md5 shuffle, FOR UPDATE SKIP LOCKED)
+2. For each icon URL:
+   - HTTP GET with timeouts (5s dial, 10s total)
+   - Read up to max-size bytes, abort if exceeded
+   - Validate magic bytes (PNG: `\x89PNG`, GIF: `GIF8`, ICO: `\x00\x00\x01\x00`, etc.)
+   - Determine actual content type from magic bytes (don't trust HTTP Content-Type)
+   - Decode dimensions:
+     - PNG/GIF/JPEG/WebP/BMP: read image header (Go `image.DecodeConfig`)
+     - ICO: parse directory entries, find largest at standard size ≤64x64
+     - SVG: set width=NULL, height=NULL
+   - Compute SHA-256 of full content
+   - Check if S3 key exists (HEAD request); if yes, skip upload (dedup)
+   - Upload to S3 if new
+3. Update icons row with results (or error)
+
+**Dry-run test:** `--limit 200 --dry-run` — prints what it would do for 200 icons. Check URLs, detected types, dimensions.
+
+**Done when:** Can download, validate, and upload icons for a small batch.
+
+### Step 3.3: Full 100K Icon Run
+
+Run against all icons in the database (likely 150K-300K icon rows for 100K hosts).
+
+Monitor:
+- icons/sec throughput
+- Error breakdown (DNS failures, timeouts, HTTP errors, invalid images)
+- S3 dedup hit rate
+- Memory usage (adjust concurrency if needed)
+
+**Validation:**
+- `SELECT scan_state, COUNT(*) FROM icons GROUP BY scan_state;` — expect mostly completed, some failed
+- `SELECT error, COUNT(*) FROM icons WHERE scan_state = 'failed' GROUP BY error ORDER BY count DESC LIMIT 20;` — understand failure modes
+- `aws s3 ls s3://everytab-icons/ | wc -l` — confirm icons in S3
+- Spot-check: download a few icons from S3, open them, verify they're valid images
+
+**Stats:** `stats/03_icon_download.json`
+
+**Done when:** Icon download complete for 100K dev set, error rate understood, S3 populated.
+
+---
+
+## Phase 4: Best Icon Selection & Bundle Generation (Stages 4-5)
+
+### Step 4.1: Best Icon Selection SQL
+
+Write `pipeline/04_best_icon/select.sql`:
+
+```sql
+UPDATE hosts h SET best_icon_s3_key = sub.s3_key
+FROM (
+  SELECT DISTINCT ON (i.host_id) i.host_id, i.s3_key
+  FROM icons i
+  WHERE i.scan_state = 'completed'
+  ORDER BY i.host_id,
+    CASE
+      WHEN i.width = i.height AND i.width IN (64, 48, 32, 16) THEN 0
+      WHEN i.width = i.height AND i.width <= 64 THEN 1
+      WHEN i.width IS NOT NULL AND i.width <= 64 AND i.height <= 64 THEN 2
+      ELSE 3
+    END,
+    COALESCE(i.width, 0) DESC,
+    CASE
+      WHEN i.content_type IN ('image/png', 'image/gif', 'image/x-icon', 'image/vnd.microsoft.icon') THEN 0
+      WHEN i.content_type = 'image/webp' THEN 1
+      WHEN i.content_type = 'image/svg+xml' THEN 2
+      ELSE 3
+    END,
+    i.file_size ASC
+) sub
+WHERE h.id = sub.host_id;
+```
+
+**Validation:**
+- `SELECT COUNT(*) FROM hosts WHERE best_icon_s3_key IS NOT NULL;` — expect 60-80% of hosts
+- `SELECT COUNT(*) FROM hosts WHERE html_title IS NOT NULL AND best_icon_s3_key IS NULL;` — hosts with title but no icon (will still be in bundles)
+- Spot-check: for a few hosts, verify the selected icon is reasonable (correct size, valid)
+
+**Stats:** `stats/04_best_icon.json`
+
+**Done when:** best_icon_s3_key populated for hosts that have valid icons.
+
+### Step 4.2: Bundle Generator Go Program
+
+```
+pipeline/05_bundle_gen/
+├── main.go          # Entry point, CLI flags
+├── db.go            # Query hosts + icon keys
+├── convert.go       # Icon format conversion → PNG
+├── bundle.go        # Chunk + serialize JSON
+└── s3.go            # Upload bundles to everytab-site
+```
+
+CLI flags:
+- `--db` connection string
+- `--icons-bucket` (default `everytab-icons`)
+- `--site-bucket` (default `everytab-site`)
+- `--entries-per-bundle` (tunable, start at 120)
+- `--dry-run` (generate bundles to local disk, don't upload)
+- `--limit` (only process N hosts, for testing)
+
+### Step 4.3: Icon Conversion Logic
+
+Implement format conversion to PNG:
+1. Download icon from S3 by key
+2. Detect format from magic bytes
+3. Decode:
+   - PNG: decode directly
+   - ICO: parse container, extract image at recorded width/height, decode BMP or PNG within
+   - GIF/JPEG/BMP/WebP: decode to RGBA
+   - SVG: rasterize to 32x32 (use a Go SVG library, or shell out to `rsvg-convert` if simpler)
+4. Re-encode as PNG (optimized, don't upscale)
+5. Base64-encode
+
+**Test:** Convert 50 icons of mixed formats manually, verify output PNGs look correct.
+
+### Step 4.4: Bundle Assembly + Upload
+
+Implement:
+1. Query all hosts WHERE html_title IS NOT NULL, randomize (ORDER BY random())
+2. For each host: fetch + convert its icon (or set empty string if no icon)
+3. Assemble entries into chunks of `ENTRIES_PER_BUNDLE`
+4. Serialize each chunk as JSON (`tabs/{n}.json`)
+5. Upload to S3 `everytab-site/tabs/`
+6. Record total bundle count
+
+**Dry-run:** Generate bundles to local disk, inspect a few:
+- Valid JSON
+- Icons render in browser (paste a data:image/png;base64,... URI)
+- Entries have host, title, icon, icon_w, icon_h, iframe_ok
+
+**Validation:**
+- Bundle files exist in S3
+- `aws s3 ls s3://everytab-site/tabs/ | wc -l` matches expected count
+- Random bundle can be fetched and parsed as JSON
+- Total hosts across all bundles = count of hosts with titles
+
+**Stats:** `stats/05_bundle_gen.json`
+
+**Done when:** All bundles uploaded to S3, JSON is valid, icons render.
+
+---
+
+## Phase 5: Frontend (Stage 6)
+
+This phase can begin in parallel with Phase 3-4 using mock bundle data.
+
+### Step 5.1: Mock Data for Frontend Dev
+
+Generate 2-3 small mock bundle files (`tabs/0.json`, `tabs/1.json`, `tabs/2.json`) with ~20 entries each. Use real favicons (Google, GitHub, Wikipedia, etc.) manually base64-encoded. This lets us develop the frontend without waiting for the pipeline.
+
+Serve locally with any static file server (`python -m http.server`).
+
+**Done when:** Mock bundles exist and can be served locally.
+
+### Step 5.2: Basic Tab Rendering
+
+Build `frontend/index.html` and `frontend/site.js`:
+
+1. HTML: minimal shell with a container div, inline CSS for tab styling
+2. JS: fetch a bundle, render tabs as rows filling the viewport
+3. Tab appearance: mimic Firefox tab shape (rounded top corners, slight border)
+4. Each tab shows favicon (16x16 or 32x32 img from data URI) + truncated title
+5. No-icon tabs show title only
+
+Focus: get the visual density right. How many tabs fit across? How many rows fill the viewport? This determines `ENTRIES_PER_BUNDLE`.
+
+**Done when:** Page renders tabs from a mock bundle. Visually looks like a page full of browser tabs.
+
+### Step 5.3: Marquee Animation
+
+Add horizontal marquee to each row:
+- CSS `@keyframes` animation, translateX
+- Each row at slightly different speed and direction (some left, some right)
+- Smooth, subtle movement — not distracting, just enough to feel alive
+- Rows need extra tabs beyond viewport width to avoid gaps during scroll
+
+**Done when:** Rows scroll smoothly, no visual glitches at edges.
+
+### Step 5.4: Interaction — Click, Iframe, Close
+
+Implement tab click behavior:
+1. If `iframe_ok`: show an overlay with iframe loading the site (`{protocol}://{hostname}`)
+2. If `!iframe_ok`: open in new tab (`target="_blank"`, add rel="noopener")
+3. Visual indicator on tabs that will open externally (small icon/badge)
+4. Close overlay: X button + click-outside + Escape key
+
+**Done when:** Clicking tabs works correctly for both iframe and external cases.
+
+### Step 5.5: Infinite Scroll + Random Bundle Loading
+
+Implement:
+1. Seeded PRNG using `Date.now()` — generates deterministic sequence of bundle indices
+2. On page load: fetch first bundle, render
+3. Scroll detection: when user approaches bottom, fetch next random bundle
+4. Track loaded bundle IDs in a Set (no duplicates)
+5. Append new rows below existing ones
+6. Handle edge case: all bundles loaded (unlikely with 50K+ bundles but handle gracefully)
+
+`TOTAL_BUNDLES` is a constant baked into the JS at build time.
+
+**Done when:** Infinite scroll works, new bundles load seamlessly, no duplicate bundles.
+
+### Step 5.6: Frontend Build Script
+
+Write `pipeline/06_frontend/build.sh`:
+1. Read total bundle count (from pipeline output or S3)
+2. Inject `const TOTAL_BUNDLES = {M};` into site.js
+3. Copy index.html + site.js to S3 `everytab-site/`
+4. Invalidate CloudFront (if distribution exists)
+
+**Done when:** Build script produces deployable frontend with correct bundle count.
+
+---
+
+## Phase 6: Integration & End-to-End Test (100K)
+
+### Step 6.1: Run Full Pipeline (100K)
+
+Execute all stages in sequence on EC2:
+1. Verify hosts table has 100K entries (from Phase 1)
+2. Run WARC parser (Phase 2) — should complete in minutes
+3. Run icon downloader (Phase 3) — should complete in 10-30 minutes at 100K scale
+4. Run best icon selection (Phase 4.1)
+5. Run bundle generator (Phase 4.2-4.4)
+6. Run frontend build (Phase 5.6)
+
+**Validation:** Visit the CloudFront URL. The site should work:
+- Tabs render with real favicons and titles
+- Clicking works (iframe + external)
+- Scrolling loads more tabs
+- No JS console errors
+
+### Step 6.2: Tune Parameters
+
+Based on the 100K run:
+- **ENTRIES_PER_BUNDLE:** Look at the live site. Does one bundle fill the screen? Too many tabs? Too few? Adjust.
+- **Concurrency:** Was the icon download memory-stable? CPU-bound or network-bound? Adjust goroutine pool size.
+- **Timeouts:** What was the error distribution? Are timeouts too aggressive? Too lenient?
+- **Icon selection:** Do the selected icons look good? Any weird sizes or broken images?
+
+Update CLI flag defaults based on findings.
+
+### Step 6.3: Collect & Review Stats
+
+Merge all `stats/*.json` into a single pipeline report. Review:
+- Loss at each stage (domains → parsed → icons downloaded → icons selected → bundled)
+- Time per stage
+- Error patterns (are certain TLDs failing more? certain icon formats?)
+- Storage usage (S3 icons bucket, S3 site bucket)
+
+Identify any pipeline bugs or data quality issues. Fix before scaling up.
+
+**Done when:** End-to-end works at 100K, parameters tuned, stats reviewed, bugs fixed.
+
+---
+
+## Phase 7: Full-Scale Run (30M)
+
+### Step 7.1: Remove Limits, Re-run CC-Index Query
+
+Update the DuckDB query to remove `LIMIT 100000`. Re-run.
+
+Considerations:
+- If httpfs takes >1hr, switch to downloading the parquet files first
+- May need to increase RDS storage (30M rows with WARC paths ≈ 5-10GB)
+- Monitor DuckDB memory usage
+
+**Validation:** `SELECT COUNT(*) FROM hosts;` shows ~30M rows.
+
+### Step 7.2: Run WARC Parser at Scale
+
+Run with full concurrency against 30M hosts. Expected time: 2-6 hours.
+
+Monitor:
+- Throughput (hosts/sec)
+- Error rate stability (should plateau, not climb)
+- Postgres connection pool health
+- Memory usage
+
+### Step 7.3: Run Icon Downloader at Scale
+
+This is the long pole — expected 12-48 hours.
+
+Monitor continuously:
+- icons/sec rate
+- DNS cache hit rate (check Unbound stats: `unbound-control stats`)
+- S3 upload rate
+- Error rate by type
+- Completion percentage
+
+If too slow (projected >48hrs):
+- Consider increasing concurrency (if memory allows)
+- Consider spinning up fleet (add more EC2 instances running the same binary)
+- Check if DNS is the bottleneck (Unbound stats)
+- Check if S3 uploads are the bottleneck (batch or reduce HEAD checks)
+
+### Step 7.4: Best Icon Selection + Bundle Generation
+
+Run at full scale. Expected: 1-2 hours total.
+
+Monitor bundle sizes — verify they're in the expected range with `ENTRIES_PER_BUNDLE` from tuning.
+
+### Step 7.5: Rebuild Frontend + Deploy
+
+Run frontend build with the real bundle count. Invalidate CloudFront.
+
+**Validation:** Visit the live site. Browse around. Check:
+- Tab variety (seeing diverse sites, not just one TLD)
+- Icon quality (no broken images, reasonable sizes)
+- Performance (bundles load quickly, no jank)
+- Stats page / stats.json looks correct
+
+**Done when:** Full-scale site is live and working.
+
+---
+
+## Phase 8: Backup & Teardown
+
+### Step 8.1: Backup RDS to Homelab
+
+```bash
+# On EC2 (fast connection to RDS):
+pg_dump -Fc $DATABASE_URL > everytab_dump.pgfc
+
+# Transfer to homelab (from EC2 or direct):
+scp everytab_dump.pgfc homelab:/backups/everytab/
+
+# On homelab, verify restore:
+pg_restore -d everytab_local everytab_dump.pgfc
+psql everytab_local -c "SELECT COUNT(*) FROM hosts; SELECT COUNT(*) FROM icons;"
+```
+
+### Step 8.2: Backup Icons S3 to Homelab
+
+```bash
+# From homelab (or EC2 as intermediary):
+aws s3 sync s3://everytab-icons/ /backups/everytab/icons/
+
+# Verify file count matches:
+ls /backups/everytab/icons/ | wc -l
+# Compare with: aws s3 ls s3://everytab-icons/ | wc -l
+```
+
+### Step 8.3: Verify & Teardown
+
+After confirming backups:
+
+```bash
+# Verify the live site still works (it only depends on everytab-site + CloudFront)
+curl -s https://your-cloudfront-domain.net/ | head
+
+# Teardown scanning infrastructure:
+aws rds delete-db-instance --db-instance-identifier everytab --skip-final-snapshot
+aws s3 rb s3://everytab-icons --force
+aws ec2 terminate-instances --instance-ids i-xxxxx
+```
+
+**Done when:** Only `everytab-site` S3 bucket + CloudFront remain running. Monthly cost: ~$2-4.
+
+---
+
+## Development Notes
+
+### What Can Be Parallelized
+
+- **Frontend dev (Phase 5.1-5.5)** can happen at any time using mock data
+- **AWS infra setup (Phase 0.2)** can happen while writing code locally
+- **Icon downloader (Phase 3)** and **bundle generator (Phase 4)** are independent codebases, can be written in parallel
+
+### Testing Strategy
+
+- **Dry-run flags** on all Go programs: print what would happen without mutating DB/S3
+- **--limit flags** on all Go programs: process a small subset quickly
+- **Spot-checks:** after each stage, manually verify 5-10 random entries
+- **Stats files:** compare counts between stages to catch data loss
+- **100K dev set:** full pipeline at small scale before committing to a 24hr+ full run
+
+### Common Pitfalls to Watch For
+
+- **DuckDB CC-Index path:** The exact S3 path to parquet files changes per crawl. Check Common Crawl's website for the latest crawl ID and index location.
+- **WARC record format:** WARC records have a specific envelope format (WARC/1.0 header, blank line, HTTP response). Don't assume the HTTP response starts at byte 0.
+- **Relative icon URLs:** `/favicon.ico` is relative to root, but `favicon.ico` (no leading slash) is relative to the page path. Since we only have root pages (`/`), both resolve the same. But `../icons/fav.png` could be tricky — handle gracefully or skip.
+- **ICO files are complex:** The ICO container format can embed BMP (with a modified header) or PNG. Many "ICO" files are actually just PNGs renamed to .ico. Check magic bytes, not file extension.
+- **SVG rasterization:** Go doesn't have great native SVG support. Consider shelling out to `rsvg-convert` or `librsvg`, or use a Go library like `github.com/nicholasgasior/goresvg`. This can be a follow-up if SVG icons are rare.
+- **Postgres connection limits:** RDS db.t3.medium has max_connections ≈ 80. With 1000 goroutines, we need connection pooling (pgx pool handles this). Set pool max to ~40 connections.
+- **S3 eventual consistency:** After uploading an icon, a HEAD request might not find it immediately. For dedup checks, handle "not found" gracefully (just upload again — idempotent since key is content hash).
+- **CloudFront caching:** After deploying new bundles, invalidate `/*` or set short TTL during development. For production, use long TTLs (bundles are immutable between crawls).