updated PLAN.md for future plans

This commit is contained in:
Joe Lothan 2026-05-19 08:34:42 -04:00
parent 85b663a6e8
commit 2f4e5b585d

86
PLAN.md
View file

@ -561,22 +561,78 @@ On completion, each program prints a summary line and writes its stats JSON (wit
---
## Future Improvements
## Phase 7: 3M Scale Test
Validates disk-based icon storage at scale and gets real timing estimates.
- Tear down current infra, bring up fresh (300GB EBS)
- Run full pipeline with `--limit 3000000`
- Icon downloader writes to local disk (`--icons-dir icons`) instead of S3
- Bundle gen reads from local disk (`--icons-dir icons`)
- **Expected:** ~8hrs total, ~30GB disk for icons
- **Watch for:** DuckDB LIMIT behavior, disk I/O during icon download, any OOM issues
## Phase 8: 30M Full Run (Single Machine)
Full internet scan on one c5.xlarge.
- `--limit 0` for CC-Index query
- **Expected:** ~3-4 days total (WARC parsing ~25hrs, icon download ~50hrs)
- Run in tmux, monitor with `psql` queries from another session
- **Expected disk:** ~200-300GB for icons
- **Cost:** ~$44 (EC2 + RDS + EBS for 4 days)
- After completion: deploy frontend, verify live site, backup icons to homelab via rsync
## Phase 9: Frontend Polish
Before public launch:
- Cross-browser tab styling — use actual browser screenshots as reference for Chrome/Firefox/Safari
- Mobile responsive layout
- Performance — IntersectionObserver to pause off-screen marquee rows, reduce DOM count
- Stats page — render pipeline stats (host count, icon coverage, crawl date)
- Test across browsers and devices
## Phase 10: Parallelization (if needed)
Only pursue if single-machine time (~3-4 days) is unacceptable.
**Approach:** N machines for WARC parsing + icon download, consolidate, bundle on one machine.
- WARC parsing: already supports fleet via `FOR UPDATE SKIP LOCKED` — run same binary on N machines pointing at same RDS
- Icon download: same `SKIP LOCKED` pattern — each machine downloads to local disk
- Consolidation: rsync icons from N machines to one machine
- Bundle gen: runs on the consolidation machine
- **Alternative:** partition by `id % N` so each machine owns a shard end-to-end (WARC → icons → bundles), no consolidation needed. Bundle numbering uses non-overlapping ranges.
- **Infrastructure:** Terraform variable `ec2_count = N`, each instance gets same IAM/security group
- **Expected speedup:** ~linear (4 machines ≈ 4x faster ≈ 1 day for full scan)
**Simpler alternative to parallelization:**
- Download only `/favicon.ico` (skip link_rel icons) — cuts icon count from ~67M to ~30M, roughly halving the longest stage. Minimal quality loss since most best-selected icons are favicon.ico anyway.
- Use `c5n.xlarge` (25 Gbps NIC) instead of `c5.xlarge` (10 Gbps) — check with `iftop` if network is actually the bottleneck first.
## Phase 11: Automation
Monthly pipeline triggered by new Common Crawl release.
- Shell script wrapping terraform + pipeline stages + deploy
- Detect new crawl: cron job checking `https://index.commoncrawl.org/collinfo.json` weekly
- Compare latest crawl ID against last processed (stored in a file or S3 tag)
- On new crawl: terraform up → run pipeline → deploy → terraform down
- Notifications: email/webhook on success or failure
- Later: move to Forgejo CI with manual trigger button + scheduled trigger
---
## Future Improvements (Non-Blocking)
### Pipeline
- **WARC parser: retry on fetch errors** — Currently 3 fetch errors out of 100K (tolerable loss). Could add 1 retry with backoff for transient S3 errors.
- **WARC parser: batch DB inserts** — Currently one INSERT per icon. Using pgx batch or CopyFrom could improve DB write throughput and potentially unblock higher concurrency.
- **WARC parser: investigate throughput ceiling** — 300 hosts/sec at both 100 and 500 concurrency suggests a bottleneck. Profile to determine if it's S3 response latency, Postgres writes, or something else. For the full 30M run this determines wall-clock time (~28 hours at current rate).
- **CC-Index query: c5.2xlarge for full run** — 8GB is tight with 6.4GB usage + swap. 16GB instance for the 30M-host full run.
- **Encoding: investigate remaining garbled titles** — Some titles still show `<60>` in output (e.g., `BERGSTRANDS BAGERI <20>...`). These are pages that lie about their encoding. Could try more aggressive charset detection heuristics.
- **Icon download: retry transient failures** — DNS and timeout failures could benefit from a single retry. Would recover a small percentage of icons.
- **Icon download: download large link_rel icons** — Currently skipping declared sizes >64x64. Re-run with broader filter for future high-res projects.
- **Bundle gen: SVG rasterization** — ~1,077 hosts have SVG-only favicons. Could add `rsvg-convert` or a Go SVG library to rasterize these.
- **Bundle gen: smarter downscaling** — Currently nearest-neighbor to 32x32 for >128px icons. Could use bilinear/Lanczos for better quality, or preserve aspect ratio for non-square icons.
- **WARC parser: retry on fetch errors** — add 1 retry with backoff for transient S3 errors
- **WARC parser: batch DB inserts** — pgx batch or CopyFrom for better write throughput
- **Encoding: remaining garbled titles** — more aggressive charset detection heuristics
- **Icon download: retry transient failures** — single retry for DNS/timeout errors
- **Bundle gen: SVG rasterization** — recover ~1,077 hosts with SVG-only favicons
- **Bundle gen: bilinear downscaling** — better quality than nearest-neighbor for >128px icons
### Frontend
- **Performance: reduce DOM / animation cost** — Pause marquee animation on off-screen rows (IntersectionObserver). Virtualize rows to reduce total DOM element count.
- **Cross-browser tab styling** — Polish Chrome/Firefox/Safari tab appearances to more closely match real browser tabs. Test on actual browsers, use screenshots as reference.
- **Mobile layout** — Current design assumes desktop viewport. Need responsive tab sizing and touch-friendly interaction.
- **Build script**`pipeline/06_frontend/build.sh` to inject TOTAL_BUNDLES and deploy to S3 + CloudFront invalidation.
- **Stats page** — Serve `stats.json` and render pipeline stats (host count, icon coverage, crawl date) on the site.
- **Cross-browser tab styling** — match real browser tabs more closely
- **Mobile layout** — responsive tab sizing, touch-friendly interaction
- **Stats page** — pipeline stats rendered on the site