improved write efficency, though we are still bottlenecking on RDS - will switch to local postgres for future runs

2026-05-20 22:38:23 -04:00 · 2026-05-20 22:38:23 -04:00 · 4fa40c7b47
commit 4fa40c7b47
parent baf657a8ed
3 changed files with 98 additions and 77 deletions
--- a/PLAN.md
+++ b/PLAN.md
@ -643,6 +643,13 @@ Full internet scan. Upgraded to c5.2xlarge + db.m5.large after 300K run revealed
 - **Channel buffer sizing matters.** Small buffers (5K) caused micro-stalls every time the DB fetcher ran a query. 20K buffers give the fetcher enough runway to query without starving workers.
 - **DuckDB temp_directory is critical at scale.** Without it, DuckDB spills to tmpfs (RAM-backed), which then swaps to disk via the OS — double indirection. Pointing temp_directory at EBS lets DuckDB manage spill efficiently with large sequential I/O.
 - **COUNT(*) on large partial indexes is expensive.** The startup query `SELECT COUNT(*) FROM hosts WHERE parsed = FALSE` on 26M rows took minutes. Unnecessary — just start processing and discover completion naturally.
+- **Autovacuum competes with heavy writes for disk I/O.** Millions of UPDATEs create dead row versions. Autovacuum kicks in to clean them, saturating disk I/O and stalling writers. Fix: disable autovacuum on hosts/icons during the pipeline run, re-enable + manual `VACUUM ANALYZE` at the end. Now automated in the WARC parser code.
+- **RDS storage must be sized for the full run.** 20GB (set during 100K dev) filled up at 16GB during the 30M run. Icons table at 80M rows with indexes needs ~25-30GB. Default bumped to 50GB. RDS storage can only be increased, never decreased.
+- **Multiple DB writer goroutines help throughput.** Single writer couldn't drain resultCh fast enough — 8 writers with independent buffers keep up to 8 batches in flight to RDS simultaneously.
+- **RDS storage optimization causes temporary I/O degradation.** After expanding storage, RDS runs background optimization that competes with writes. Can last up to an hour. Plan storage right from the start to avoid mid-run resizes.
+- **gp3 IOPS baseline (3000) is a hard limit at scale.** 8 DB writers with batch size 500 exhausted EBS I/O burst credits (EBSIOBalance% hit 0%), causing 38ms write latency (normal <5ms) and pipeline stalls. Fix: reduce to 3 writers with batch size 1000 — fewer, larger flushes stay under 3000 IOPS. Custom provisioned IOPS on gp3 requires 400GB+ storage (not worth it for a temp DB).
+- **Consider running Postgres locally on EC2 for future runs.** RDS gp3 IOPS (3000 baseline) is the main bottleneck for WARC parsing writes. Running Postgres directly on the EC2 instance's 1TB EBS volume eliminates the network hop to RDS and the separate IOPS budget. Also saves the RDS cost ($12-15/run). Tradeoff: must install and configure Postgres yourself (or add to ec2-userdata.sh).
+- **Reconsider c5.xlarge for future runs.** Upgraded to c5.2xlarge assuming CPU was the bottleneck, but RDS IOPS turned out to be the real constraint for WARC parsing, and icon download is internet-bound. If the extra cores don't meaningfully improve throughput (check CPU utilization during the full run), c5.xlarge at half the cost ($0.17/hr vs $0.34/hr) may be sufficient. The only stage that clearly benefits from 8 cores is bundle gen (~2hrs saved).

 ## Phase 9: Frontend Polish