updated PLAN.md and ARCHITECTURE.md with new instance type and performance concerns
This commit is contained in:
parent
b419b5bf6c
commit
baf657a8ed
2 changed files with 98 additions and 25 deletions
42
PLAN.md
42
PLAN.md
|
|
@ -606,25 +606,43 @@ Adversarial code review followed by performance improvements, validated with a 3
|
|||
→ 2,358 bundles, 603MB total
|
||||
```
|
||||
|
||||
## Phase 8: 30M Full Run (Single Machine)
|
||||
## Phase 8: 30M Full Run [IN PROGRESS]
|
||||
|
||||
Full internet scan on one c5.xlarge.
|
||||
Full internet scan. Upgraded to c5.2xlarge + db.m5.large after 300K run revealed bottlenecks.
|
||||
|
||||
### Infrastructure changes (mid-run)
|
||||
- **EC2: c5.xlarge → c5.2xlarge** (8 vCPUs, 16GB RAM). WARC parsing was CPU-bound at 100% on 4 cores. Icon download was memory-limited at 2500 concurrent connections. Bundle gen CPU-bound. The extra cores and RAM benefit all three stages.
|
||||
- **RDS: db.t3.medium → db.m5.large** (non-burstable, 2 vCPUs, 8GB RAM). The t3 burstable instance was getting CPU-throttled under sustained write load, causing the WARC parser's DB writer to stall and back-pressure workers.
|
||||
- **Swap: 4GB → 8GB** (2x physical RAM as safety margin).
|
||||
- **DuckDB temp_directory** set to EBS (`~/duckdb_temp`) instead of defaulting to tmpfs. DuckDB's managed spill-to-disk is far more efficient than OS swap — sequential large reads vs random 4KB page faults.
|
||||
|
||||
### Tuning changes
|
||||
- **WARC parser concurrency: 500 → 500** (kept, but now on 8 cores instead of 4 — more actual throughput)
|
||||
- **WARC parser write batch: 100 → 500** (~2000 queries per DB round-trip). Fewer flushes = less back-pressure on workers.
|
||||
- **WARC parser startup**: removed slow `COUNT(*) WHERE parsed = FALSE` query (scans 26M-row index, takes minutes). Not needed — fetcher discovers empty results naturally.
|
||||
- **WARC parser channel buffers**: hostCh 5K → 20K, resultCh 500 → 1K. Prevents micro-stalls between DB fetcher queries.
|
||||
- **Icon download concurrency: 1000 → 5000** (16GB RAM supports the connection overhead).
|
||||
- **Icon download channel buffer**: 5K → 20K.
|
||||
- **Bundle gen concurrency: 20 → 40** (8 cores × 5).
|
||||
- **Bundle gen channel buffers**: 1.2K → 6K.
|
||||
- **Debug logging** added to WARC parser fetcher and writer to diagnose stalls.
|
||||
|
||||
### Run parameters
|
||||
- `--limit 0` for CC-Index query
|
||||
- **Expected (extrapolated from 300K run):**
|
||||
- CC-Index: ~10min (download) + ~15min (query) — possibly much longer at 30M due to swap thrashing
|
||||
- WARC parsing: ~14-15hrs (566 hosts/sec)
|
||||
- Icon download: ~50-60hrs (439 icons/sec at 1000 concurrency, the long pole — 2500 concurrency may improve this)
|
||||
- Bundle gen: ~3.5hrs (2,377 hosts/sec)
|
||||
- **Total: ~3 days**
|
||||
- **CC-Index result:** 26,703,146 hosts (20.6M https, 6.1M http)
|
||||
- Run in tmux, monitor with `psql` queries from another session
|
||||
- **Expected disk:** ~650GB for all icons (6.5GB per 300K × 100)
|
||||
- **Cost:** ~$50 (EC2 + RDS + 1TB EBS for 3-4 days)
|
||||
- **Expected disk:** ~650GB for all icons
|
||||
- **Expected time:** ~2 days (with upgraded infra, down from ~3 days estimate)
|
||||
- **Cost:** ~$65 (EC2 c5.2xlarge + RDS m5.large + 1TB EBS for 2-3 days)
|
||||
- After completion: deploy frontend, verify live site, backup icons + DB to homelab via rsync
|
||||
- **Stuck icon recovery** (if icon download crashes): `UPDATE icons SET scan_state = 'unscanned' WHERE scan_state = 'in_progress';`
|
||||
|
||||
### Consider c5.2xlarge for future runs
|
||||
The CC-Index DuckDB query is memory-bound — at 30M the GROUP BY hash table exceeds 8GB and swap thrashing dominates query time. c5.2xlarge (16GB, 8 vCPUs) would eliminate swap pressure entirely and double CPU cores for bundle gen. Cost difference: $0.17/hr → $0.34/hr, but if it halves the CC-Index query time and speeds up bundle gen (CPU-bound), the total EC2 hours may decrease enough to break even. Also benefits WARC parsing (more headroom for 500+ goroutines) and icon download (more memory for 5000 concurrent connections). Worth testing on a future run.
|
||||
### Lessons learned (during 30M run)
|
||||
- **Burstable DB instances are unsuitable for pipeline workloads.** The t3.medium throttled under sustained writes, causing stalls that propagated through the entire WARC parser pipeline via channel back-pressure. Non-burstable m5 instances provide consistent performance.
|
||||
- **WARC parsing is CPU-bound, not just I/O-bound.** At 500 goroutines on 4 cores, CPU was at 100% — TLS handshakes + gzip decompression + HTML parsing add up. More cores directly increases throughput.
|
||||
- **Channel buffer sizing matters.** Small buffers (5K) caused micro-stalls every time the DB fetcher ran a query. 20K buffers give the fetcher enough runway to query without starving workers.
|
||||
- **DuckDB temp_directory is critical at scale.** Without it, DuckDB spills to tmpfs (RAM-backed), which then swaps to disk via the OS — double indirection. Pointing temp_directory at EBS lets DuckDB manage spill efficiently with large sequential I/O.
|
||||
- **COUNT(*) on large partial indexes is expensive.** The startup query `SELECT COUNT(*) FROM hosts WHERE parsed = FALSE` on 26M rows took minutes. Unnecessary — just start processing and discover completion naturally.
|
||||
|
||||
## Phase 9: Frontend Polish
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue