updated PLAN.md and ARCHITECTURE.md with new instance type and performance concerns

2026-05-20 13:17:03 -04:00 · 2026-05-20 13:17:03 -04:00 · baf657a8ed
commit baf657a8ed
parent b419b5bf6c
2 changed files with 98 additions and 25 deletions
--- a/PLAN.md
+++ b/PLAN.md
@ -606,25 +606,43 @@ Adversarial code review followed by performance improvements, validated with a 3
 → 2,358 bundles, 603MB total
 ```

-## Phase 8: 30M Full Run (Single Machine)
+## Phase 8: 30M Full Run [IN PROGRESS]

-Full internet scan on one c5.xlarge.
+Full internet scan. Upgraded to c5.2xlarge + db.m5.large after 300K run revealed bottlenecks.

+### Infrastructure changes (mid-run)
+- **EC2: c5.xlarge → c5.2xlarge** (8 vCPUs, 16GB RAM). WARC parsing was CPU-bound at 100% on 4 cores. Icon download was memory-limited at 2500 concurrent connections. Bundle gen CPU-bound. The extra cores and RAM benefit all three stages.
+- **RDS: db.t3.medium → db.m5.large** (non-burstable, 2 vCPUs, 8GB RAM). The t3 burstable instance was getting CPU-throttled under sustained write load, causing the WARC parser's DB writer to stall and back-pressure workers.
+- **Swap: 4GB → 8GB** (2x physical RAM as safety margin).
+- **DuckDB temp_directory** set to EBS (`~/duckdb_temp`) instead of defaulting to tmpfs. DuckDB's managed spill-to-disk is far more efficient than OS swap — sequential large reads vs random 4KB page faults.
+
+### Tuning changes
+- **WARC parser concurrency: 500 → 500** (kept, but now on 8 cores instead of 4 — more actual throughput)
+- **WARC parser write batch: 100 → 500** (~2000 queries per DB round-trip). Fewer flushes = less back-pressure on workers.
+- **WARC parser startup**: removed slow `COUNT(*) WHERE parsed = FALSE` query (scans 26M-row index, takes minutes). Not needed — fetcher discovers empty results naturally.
+- **WARC parser channel buffers**: hostCh 5K → 20K, resultCh 500 → 1K. Prevents micro-stalls between DB fetcher queries.
+- **Icon download concurrency: 1000 → 5000** (16GB RAM supports the connection overhead).
+- **Icon download channel buffer**: 5K → 20K.
+- **Bundle gen concurrency: 20 → 40** (8 cores × 5).
+- **Bundle gen channel buffers**: 1.2K → 6K.
+- **Debug logging** added to WARC parser fetcher and writer to diagnose stalls.
+
+### Run parameters
 - `--limit 0` for CC-Index query
- **Expected (extrapolated from 300K run):**
-  - CC-Index: ~10min (download) + ~15min (query) — possibly much longer at 30M due to swap thrashing
-  - WARC parsing: ~14-15hrs (566 hosts/sec)
-  - Icon download: ~50-60hrs (439 icons/sec at 1000 concurrency, the long pole — 2500 concurrency may improve this)
-  - Bundle gen: ~3.5hrs (2,377 hosts/sec)
-  - **Total: ~3 days**
+- **CC-Index result:** 26,703,146 hosts (20.6M https, 6.1M http)
 - Run in tmux, monitor with `psql` queries from another session
- **Expected disk:** ~650GB for all icons (6.5GB per 300K × 100)
- **Cost:** ~$50 (EC2 + RDS + 1TB EBS for 3-4 days)
+- **Expected disk:** ~650GB for all icons
+- **Expected time:** ~2 days (with upgraded infra, down from ~3 days estimate)
+- **Cost:** ~$65 (EC2 c5.2xlarge + RDS m5.large + 1TB EBS for 2-3 days)
 - After completion: deploy frontend, verify live site, backup icons + DB to homelab via rsync
 - **Stuck icon recovery** (if icon download crashes): `UPDATE icons SET scan_state = 'unscanned' WHERE scan_state = 'in_progress';`

-### Consider c5.2xlarge for future runs
-The CC-Index DuckDB query is memory-bound — at 30M the GROUP BY hash table exceeds 8GB and swap thrashing dominates query time. c5.2xlarge (16GB, 8 vCPUs) would eliminate swap pressure entirely and double CPU cores for bundle gen. Cost difference: $0.17/hr → $0.34/hr, but if it halves the CC-Index query time and speeds up bundle gen (CPU-bound), the total EC2 hours may decrease enough to break even. Also benefits WARC parsing (more headroom for 500+ goroutines) and icon download (more memory for 5000 concurrent connections). Worth testing on a future run.
+### Lessons learned (during 30M run)
+- **Burstable DB instances are unsuitable for pipeline workloads.** The t3.medium throttled under sustained writes, causing stalls that propagated through the entire WARC parser pipeline via channel back-pressure. Non-burstable m5 instances provide consistent performance.
+- **WARC parsing is CPU-bound, not just I/O-bound.** At 500 goroutines on 4 cores, CPU was at 100% — TLS handshakes + gzip decompression + HTML parsing add up. More cores directly increases throughput.
+- **Channel buffer sizing matters.** Small buffers (5K) caused micro-stalls every time the DB fetcher ran a query. 20K buffers give the fetcher enough runway to query without starving workers.
+- **DuckDB temp_directory is critical at scale.** Without it, DuckDB spills to tmpfs (RAM-backed), which then swaps to disk via the OS — double indirection. Pointing temp_directory at EBS lets DuckDB manage spill efficiently with large sequential I/O.
+- **COUNT(*) on large partial indexes is expensive.** The startup query `SELECT COUNT(*) FROM hosts WHERE parsed = FALSE` on 26M rows took minutes. Unnecessary — just start processing and discover completion naturally.

 ## Phase 9: Frontend Polish