diff --git a/PLAN.md b/PLAN.md index a342b88..3fdd40b 100644 --- a/PLAN.md +++ b/PLAN.md @@ -494,16 +494,31 @@ On completion, each program prints a summary line and writes its stats JSON (wit --- -## Phase 7: 3M Scale Test +## Phase 7: 3M Scale Test [COMPLETED] -Validates disk-based icon storage at scale and gets real timing estimates. +Validated disk-based icon storage, performance tuning, and full pipeline at 3M scale. -- Tear down current infra, bring up fresh (1TB EBS) -- Run full pipeline with `--limit 3000000` -- Icon downloader writes to local disk (`--icons-dir icons`) instead of S3 -- Downloads ALL icons (no size filter) — full archive for posterity -- Bundle gen reads from local disk (`--icons-dir icons`) -- **Watch for:** disk I/O, DuckDB LIMIT behavior, any OOM issues +**Final 3M pipeline results:** + +| Stage | Duration | Result | +|-------|----------|--------| +| CC-Index query | ~13min | 3M hosts | +| WARC parsing | ~3hrs (concurrency 50) | 2.8M titles, 6M icons | +| Icon download | 4h21m (408/s) | 4.5M completed, 53GB, 70% success | +| Best icon selection | instant | 2M hosts with icons | +| Bundle generation | 1h23m (540 hosts/sec) | 22,429 bundles, 4.7GB | +| Frontend deploy | seconds | Live at everytab.site | +| **Total** | **~9 hours** | | + +**Key changes during this phase:** +- Icons stored on local disk (sharded `ab/cd/ef/hash`), not S3 — saves ~$175 in PUT costs +- Removed icon size filter — downloads ALL icons for archival, filters at bundle gen time +- Dropped `ORDER BY md5(id::text)` from icon claim query — was causing 30-second burst/stall cycles at 3M scale +- Icon download batch size 200 → 5000, channel buffer = batch size +- Bundle gen rewritten to stream: paginated DB reads, incremental bundle writes (fixed OOM at 3M) +- `random_order` column on hosts table for shuffled bundles +- EBS volume sized at 1TB for full icon archive +- Added COSTS.md with monthly cost breakdown (~$42/month ongoing) ### Icon selection strategy (TODO: decide before Phase 8) @@ -518,66 +533,98 @@ This works but questions remain: - Should we have different strategies for different icon sources? (e.g., always use link_rel PNG 32x32 if available, fall back to favicon.ico) - At 30M scale, how much do large icons bloat total bundle size? Need data from the 3M run to decide. -## Phase 7.2: Performance Fixes + 3M Re-test +## Phase 7.2: Code Review + Performance Fixes [COMPLETED] -Code review and performance improvements before the full 30M run. Make changes, review, then re-run the full 3M pipeline to validate. +Adversarial code review followed by performance improvements, validated with a 300K run. -### Bundle gen redesign: streaming pipeline +### Code review findings and fixes +- **float32 pagination bug** — `random_order REAL` on hosts table caused ~1.5% data loss at 30M scale due to float32 collisions in keyset pagination. Fixed: `DOUBLE PRECISION` + `float64` in Go. +- **Protocol missing from bundles** — bundle JSON had `host` but no protocol. HTTP-only sites (23%) loaded as `https://` and broke. Fixed: replaced `host` field with `url` (full URL built on Go side). +- **Non-atomic bundle deployment** — bundle gen deleted all S3 bundles before writing new ones. Crash mid-write = broken live site. Fixed: overwrite in-place, deploy.sh cleans up stale bundles after cache invalidation. +- **ARCHITECTURE.md stale** — still described S3 icon storage, old claim query, old bundle format. Updated throughout to match current code. +- **Dead go.mod dependencies** — progressbar and transitive deps removed. Direct vs indirect annotations fixed via `go mod tidy`. +- **Shadowed builtins** — custom `min()`/`max()` functions removed (Go 1.21+ builtins). +- **BMP decoder missing** — standalone BMP favicons passed download but failed in bundle gen. Added `golang.org/x/image/bmp` import. +- **Frontend memory leak** — `loadedIcons` array grew unboundedly. Capped at 100 entries. +- **Iframe stats inflated** — error hosts counted as "iframe blocked" (zero value of bool). Fixed to only count successful parses. +- **CSP check incomplete** — only checked first `Content-Security-Policy` header. Fixed to check all headers via `headers.Values()`. +- **DNS error classification** — direct type assertion `err.(*net.DNSError)` never matched wrapped errors. Fixed with `errors.As()`. +- **Icon download host hammering** — adjacent same-host icons in batches caused simultaneous requests. Fixed: Fisher-Yates shuffle of each batch before feeding to workers. +- **Max icons per host** — capped at 50 link_rel icons per host in HTML parser to prevent adversarial pages from bloating the DB. +- **`downloaded_at` column** — added to icons table for data freshness tracking. -Current architecture (batch-convert-then-write): +### Pipeline performance redesign + +**WARC parser** — three-stage pipeline: ``` -[fetch 6000 hosts] → [convert all 6000 icons] → [write bundles] → [fetch next 6000] - ^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^ - all cores busy cores idle +[DB fetcher] → hostCh → [500 workers] → resultCh → [DB writer with pgx.Batch] ``` +- Channel-based worker pool (500 goroutines, up from 100 semaphore) +- S3 retry with 6 attempts (AWS SDK `retry.AddWithMaxAttempts`) +- Batched DB writes via `pgx.Batch` (100 results = ~400 queries per round-trip) +- Result: 566 hosts/sec (1.6x improvement over 352/sec) -Target architecture (fully pipelined): +**Bundle gen** — four-stage pipeline: ``` -[DB fetcher] → channel → [N converter workers] → channel → [bundle writer] - always always busy writes as soon as - prefetching 120 entries ready +[DB fetcher] → hostCh → [20 converters] → entryCh → [assembler] → uploadCh → [10 uploaders] ``` +- Converters default 20 (CPU-bound, ~5x core count on c5.xlarge) +- Separate upload workers for S3 PUT parallelism +- Result: 2,377 hosts/sec (4.4x improvement over 540/sec) -Changes: -- **DB fetcher goroutine** — continuously fetches pages, feeds host channel. Same pattern as icon downloader. -- **Converter workers** (200 goroutines) — read from host channel, read icon from disk, decode, re-encode PNG, base64, send BundleEntry to output channel. -- **Bundle writer goroutine** — collects entries from output channel into a buffer. Every 120 entries, serialize JSON and upload to S3. Runs concurrently with conversion. +**Icon download** — in-memory batch shuffle added, concurrency bumped to 1000. -This eliminates all synchronization points — DB, disk, CPU, and S3 are all utilized simultaneously. No fits and starts. +**CC-Index query** — downloads parquet files locally first (`aws s3 sync`), then queries with DuckDB. Eliminates S3 503 rate-limit failures. -### Other fixes to include -- **WARC parser: add S3 retry with backoff** — currently concurrency 50 to avoid 503s. With retry, can go back to 100+ and handle transient 503s gracefully. -- **Icon download: reduce timeout to 3-5s** — most legitimate servers respond in <1s. Dead hosts block a worker for 10s currently. -- **Icon download: confirm all icons downloaded** — size filter already removed from claim query, but verify at 3M that all link_rel icons (including large declared sizes) are being downloaded. -- **Icon selection strategy** — decide on final criteria (see Phase 7 notes above) and validate with 3M data. +**Best icon selection** — new priority: target 32x32 for Retina display. Pick smallest icon ≥32px, fall back to largest <32px. No more "standard sizes" tiers. -### Code review -Before running the 3M re-test, do a full read-through of all Go code: -- Check all error handling paths — are errors logged, counted, and surfaced? -- Check concurrency patterns — any race conditions, deadlocks, goroutine leaks? -- Check resource cleanup — are DB connections, file handles, HTTP responses closed? -- Check the new streaming bundle gen — does it handle edge cases (empty pages, partial final bundle, S3 upload failures)? -- Review all CLI flag defaults — are they tuned for the 30M run? +### Stats improvements +- WARC parser: added `no_title` counter, fixed `icons_found` to include favicon.ico +- Best icon selection: now writes `stats/04_best_icon.json` +- Bundle gen: added `bundled_with_icon` / `bundled_no_icon` counters (distinguishes "never had icon" from "convert error") -### Validation -- Run full 3M pipeline from scratch with all fixes -- Compare timings against Phase 7 run -- Confirm bundle gen saturates CPU and doesn't stall -- Confirm icon download rate improves with shorter timeout -- Confirm WARC parsing can run at higher concurrency with retry -- Confirm all icon types (including large) are downloaded -- Review live site at everytab.site with 3M data +### 300K validation run results + +| Stage | Duration | Rate | +|-------|----------|------| +| CC-Index query | 83s | — | +| WARC parsing | 8m50s | 566 hosts/sec | +| Icon download | 34m45s | 439 icons/sec | +| Best icon selection | instant | — | +| Bundle generation | 1m59s | 2,377 hosts/sec | +| Frontend deploy | seconds | — | +| **Total** | **~47 min** | | + +**Loss funnel:** +``` +300,000 hosts from CC-Index + → 282,854 with titles (94.3%) + → 213,656 bundled with icon (75.5% of titled) + → 69,198 bundled without icon (24.5%) + → 68,793 never had an icon + → 405 icon convert errors + → 2,358 bundles, 603MB total +``` ## Phase 8: 30M Full Run (Single Machine) Full internet scan on one c5.xlarge. - `--limit 0` for CC-Index query -- **Expected:** ~3-4 days total (WARC parsing ~25hrs, icon download ~50hrs) +- **Expected (extrapolated from 300K run):** + - CC-Index: ~10min (download) + ~15min (query) — possibly much longer at 30M due to swap thrashing + - WARC parsing: ~14-15hrs (566 hosts/sec) + - Icon download: ~50-60hrs (439 icons/sec at 1000 concurrency, the long pole — 2500 concurrency may improve this) + - Bundle gen: ~3.5hrs (2,377 hosts/sec) + - **Total: ~3 days** - Run in tmux, monitor with `psql` queries from another session -- **Expected disk:** ~500GB-1TB for all icons (full archive) -- **Cost:** ~$50 (EC2 + RDS + 1TB EBS for 4 days) +- **Expected disk:** ~650GB for all icons (6.5GB per 300K × 100) +- **Cost:** ~$50 (EC2 + RDS + 1TB EBS for 3-4 days) - After completion: deploy frontend, verify live site, backup icons + DB to homelab via rsync +- **Stuck icon recovery** (if icon download crashes): `UPDATE icons SET scan_state = 'unscanned' WHERE scan_state = 'in_progress';` + +### Consider c5.2xlarge for future runs +The CC-Index DuckDB query is memory-bound — at 30M the GROUP BY hash table exceeds 8GB and swap thrashing dominates query time. c5.2xlarge (16GB, 8 vCPUs) would eliminate swap pressure entirely and double CPU cores for bundle gen. Cost difference: $0.17/hr → $0.34/hr, but if it halves the CC-Index query time and speeds up bundle gen (CPU-bound), the total EC2 hours may decrease enough to break even. Also benefits WARC parsing (more headroom for 500+ goroutines) and icon download (more memory for 5000 concurrent connections). Worth testing on a future run. ## Phase 9: Frontend Polish @@ -621,17 +668,18 @@ Monthly pipeline triggered by new Common Crawl release. ## Future Improvements (Non-Blocking) ### Pipeline -- **WARC parser: retry on fetch errors** — add 1 retry with backoff for transient S3 errors -- **WARC parser: batch DB inserts** — pgx batch or CopyFrom for better write throughput +- **CC-Index query: streaming dedup** — current GROUP BY builds a ~30M-row hash table in memory, causing severe swap thrashing on c5.xlarge (8GB). Options: (1) use `INSERT ... ON CONFLICT (hostname) DO UPDATE` to stream rows into Postgres and let the UNIQUE constraint dedup, eliminating the hash table entirely; (2) process parquet files in smaller batches, dedup per-batch in DuckDB, final dedup in Postgres; (3) just use c5.2xlarge (16GB) to fit the hash table in RAM. Current workaround: `SET temp_directory` to let DuckDB spill to EBS instead of OS swap. - **Encoding: remaining garbled titles** — more aggressive charset detection heuristics - **Icon download: retry transient failures** — single retry for DNS/timeout errors -- **Bundle gen: SVG rasterization** — recover ~1,077 hosts with SVG-only favicons +- **Bundle gen: SVG rasterization** — shell out to `rsvg-convert` for SVG-only hosts (~3.5% of icons) - **Bundle gen: bilinear downscaling** — better quality than nearest-neighbor for >128px icons ### Frontend -- **Cross-browser tab styling** — match real browser tabs more closely - **Mobile layout** — responsive tab sizing, touch-friendly interaction -- **Stats page** — pipeline stats rendered on the site +- **Stats page / Sankey diagram** — pipeline loss funnel rendered on the site -### Icon Download Ordering -- **Verify icon row ordering at 30M scale** — WARC parser inserts ~2-5 icons per host in sequence (favicon_ico + link_rel entries). Without ORDER BY, Postgres returns rows roughly in insertion order, so icons from the same host are adjacent. At 100K-3M this didn't cause problems (batch size 5000 means each batch has icons from ~2,000+ different hosts). At 30M, confirm with `iftop` that we're not hammering individual hosts. If needed, add `random_order REAL DEFAULT random()` column to icons table and use it in the claim query — but don't index it (60M+ writes). +### Additional Metadata +- **`http_status INT` on icons** — structured HTTP status code (currently only stored as error text). Enables analysis like "404 (site moved) vs 403 (bot blocked) vs 500 (server error)". +- **`response_time_ms INT` on icons** — server response latency. Useful for tuning timeouts, identifying slow hosts, health signal. +- **`parsed_at TIMESTAMPTZ` on hosts** — when the WARC was parsed. Currently only a `parsed` boolean. +- **`created_at TIMESTAMPTZ DEFAULT now()` on hosts** — when the host entered the pipeline.