From 2f4e5b585dece56511af62a17bb6386537603dd6 Mon Sep 17 00:00:00 2001 From: Joe Lothan Date: Tue, 19 May 2026 08:34:42 -0400 Subject: [PATCH] updated PLAN.md for future plans --- PLAN.md | 86 +++++++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 71 insertions(+), 15 deletions(-) diff --git a/PLAN.md b/PLAN.md index 14c7374..792cd88 100644 --- a/PLAN.md +++ b/PLAN.md @@ -561,22 +561,78 @@ On completion, each program prints a summary line and writes its stats JSON (wit --- -## Future Improvements +## Phase 7: 3M Scale Test + +Validates disk-based icon storage at scale and gets real timing estimates. + +- Tear down current infra, bring up fresh (300GB EBS) +- Run full pipeline with `--limit 3000000` +- Icon downloader writes to local disk (`--icons-dir icons`) instead of S3 +- Bundle gen reads from local disk (`--icons-dir icons`) +- **Expected:** ~8hrs total, ~30GB disk for icons +- **Watch for:** DuckDB LIMIT behavior, disk I/O during icon download, any OOM issues + +## Phase 8: 30M Full Run (Single Machine) + +Full internet scan on one c5.xlarge. + +- `--limit 0` for CC-Index query +- **Expected:** ~3-4 days total (WARC parsing ~25hrs, icon download ~50hrs) +- Run in tmux, monitor with `psql` queries from another session +- **Expected disk:** ~200-300GB for icons +- **Cost:** ~$44 (EC2 + RDS + EBS for 4 days) +- After completion: deploy frontend, verify live site, backup icons to homelab via rsync + +## Phase 9: Frontend Polish + +Before public launch: +- Cross-browser tab styling — use actual browser screenshots as reference for Chrome/Firefox/Safari +- Mobile responsive layout +- Performance — IntersectionObserver to pause off-screen marquee rows, reduce DOM count +- Stats page — render pipeline stats (host count, icon coverage, crawl date) +- Test across browsers and devices + +## Phase 10: Parallelization (if needed) + +Only pursue if single-machine time (~3-4 days) is unacceptable. + +**Approach:** N machines for WARC parsing + icon download, consolidate, bundle on one machine. +- WARC parsing: already supports fleet via `FOR UPDATE SKIP LOCKED` — run same binary on N machines pointing at same RDS +- Icon download: same `SKIP LOCKED` pattern — each machine downloads to local disk +- Consolidation: rsync icons from N machines to one machine +- Bundle gen: runs on the consolidation machine +- **Alternative:** partition by `id % N` so each machine owns a shard end-to-end (WARC → icons → bundles), no consolidation needed. Bundle numbering uses non-overlapping ranges. +- **Infrastructure:** Terraform variable `ec2_count = N`, each instance gets same IAM/security group +- **Expected speedup:** ~linear (4 machines ≈ 4x faster ≈ 1 day for full scan) + +**Simpler alternative to parallelization:** +- Download only `/favicon.ico` (skip link_rel icons) — cuts icon count from ~67M to ~30M, roughly halving the longest stage. Minimal quality loss since most best-selected icons are favicon.ico anyway. +- Use `c5n.xlarge` (25 Gbps NIC) instead of `c5.xlarge` (10 Gbps) — check with `iftop` if network is actually the bottleneck first. + +## Phase 11: Automation + +Monthly pipeline triggered by new Common Crawl release. + +- Shell script wrapping terraform + pipeline stages + deploy +- Detect new crawl: cron job checking `https://index.commoncrawl.org/collinfo.json` weekly +- Compare latest crawl ID against last processed (stored in a file or S3 tag) +- On new crawl: terraform up → run pipeline → deploy → terraform down +- Notifications: email/webhook on success or failure +- Later: move to Forgejo CI with manual trigger button + scheduled trigger + +--- + +## Future Improvements (Non-Blocking) ### Pipeline -- **WARC parser: retry on fetch errors** — Currently 3 fetch errors out of 100K (tolerable loss). Could add 1 retry with backoff for transient S3 errors. -- **WARC parser: batch DB inserts** — Currently one INSERT per icon. Using pgx batch or CopyFrom could improve DB write throughput and potentially unblock higher concurrency. -- **WARC parser: investigate throughput ceiling** — 300 hosts/sec at both 100 and 500 concurrency suggests a bottleneck. Profile to determine if it's S3 response latency, Postgres writes, or something else. For the full 30M run this determines wall-clock time (~28 hours at current rate). -- **CC-Index query: c5.2xlarge for full run** — 8GB is tight with 6.4GB usage + swap. 16GB instance for the 30M-host full run. -- **Encoding: investigate remaining garbled titles** — Some titles still show `�` in output (e.g., `BERGSTRANDS BAGERI �...`). These are pages that lie about their encoding. Could try more aggressive charset detection heuristics. -- **Icon download: retry transient failures** — DNS and timeout failures could benefit from a single retry. Would recover a small percentage of icons. -- **Icon download: download large link_rel icons** — Currently skipping declared sizes >64x64. Re-run with broader filter for future high-res projects. -- **Bundle gen: SVG rasterization** — ~1,077 hosts have SVG-only favicons. Could add `rsvg-convert` or a Go SVG library to rasterize these. -- **Bundle gen: smarter downscaling** — Currently nearest-neighbor to 32x32 for >128px icons. Could use bilinear/Lanczos for better quality, or preserve aspect ratio for non-square icons. +- **WARC parser: retry on fetch errors** — add 1 retry with backoff for transient S3 errors +- **WARC parser: batch DB inserts** — pgx batch or CopyFrom for better write throughput +- **Encoding: remaining garbled titles** — more aggressive charset detection heuristics +- **Icon download: retry transient failures** — single retry for DNS/timeout errors +- **Bundle gen: SVG rasterization** — recover ~1,077 hosts with SVG-only favicons +- **Bundle gen: bilinear downscaling** — better quality than nearest-neighbor for >128px icons ### Frontend -- **Performance: reduce DOM / animation cost** — Pause marquee animation on off-screen rows (IntersectionObserver). Virtualize rows to reduce total DOM element count. -- **Cross-browser tab styling** — Polish Chrome/Firefox/Safari tab appearances to more closely match real browser tabs. Test on actual browsers, use screenshots as reference. -- **Mobile layout** — Current design assumes desktop viewport. Need responsive tab sizing and touch-friendly interaction. -- **Build script** — `pipeline/06_frontend/build.sh` to inject TOTAL_BUNDLES and deploy to S3 + CloudFront invalidation. -- **Stats page** — Serve `stats.json` and render pipeline stats (host count, icon coverage, crawl date) on the site. +- **Cross-browser tab styling** — match real browser tabs more closely +- **Mobile layout** — responsive tab sizing, touch-friendly interaction +- **Stats page** — pipeline stats rendered on the site