updated PLAN.md for future plans
This commit is contained in:
parent
85b663a6e8
commit
2f4e5b585d
1 changed files with 71 additions and 15 deletions
86
PLAN.md
86
PLAN.md
|
|
@ -561,22 +561,78 @@ On completion, each program prints a summary line and writes its stats JSON (wit
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Future Improvements
|
## Phase 7: 3M Scale Test
|
||||||
|
|
||||||
|
Validates disk-based icon storage at scale and gets real timing estimates.
|
||||||
|
|
||||||
|
- Tear down current infra, bring up fresh (300GB EBS)
|
||||||
|
- Run full pipeline with `--limit 3000000`
|
||||||
|
- Icon downloader writes to local disk (`--icons-dir icons`) instead of S3
|
||||||
|
- Bundle gen reads from local disk (`--icons-dir icons`)
|
||||||
|
- **Expected:** ~8hrs total, ~30GB disk for icons
|
||||||
|
- **Watch for:** DuckDB LIMIT behavior, disk I/O during icon download, any OOM issues
|
||||||
|
|
||||||
|
## Phase 8: 30M Full Run (Single Machine)
|
||||||
|
|
||||||
|
Full internet scan on one c5.xlarge.
|
||||||
|
|
||||||
|
- `--limit 0` for CC-Index query
|
||||||
|
- **Expected:** ~3-4 days total (WARC parsing ~25hrs, icon download ~50hrs)
|
||||||
|
- Run in tmux, monitor with `psql` queries from another session
|
||||||
|
- **Expected disk:** ~200-300GB for icons
|
||||||
|
- **Cost:** ~$44 (EC2 + RDS + EBS for 4 days)
|
||||||
|
- After completion: deploy frontend, verify live site, backup icons to homelab via rsync
|
||||||
|
|
||||||
|
## Phase 9: Frontend Polish
|
||||||
|
|
||||||
|
Before public launch:
|
||||||
|
- Cross-browser tab styling — use actual browser screenshots as reference for Chrome/Firefox/Safari
|
||||||
|
- Mobile responsive layout
|
||||||
|
- Performance — IntersectionObserver to pause off-screen marquee rows, reduce DOM count
|
||||||
|
- Stats page — render pipeline stats (host count, icon coverage, crawl date)
|
||||||
|
- Test across browsers and devices
|
||||||
|
|
||||||
|
## Phase 10: Parallelization (if needed)
|
||||||
|
|
||||||
|
Only pursue if single-machine time (~3-4 days) is unacceptable.
|
||||||
|
|
||||||
|
**Approach:** N machines for WARC parsing + icon download, consolidate, bundle on one machine.
|
||||||
|
- WARC parsing: already supports fleet via `FOR UPDATE SKIP LOCKED` — run same binary on N machines pointing at same RDS
|
||||||
|
- Icon download: same `SKIP LOCKED` pattern — each machine downloads to local disk
|
||||||
|
- Consolidation: rsync icons from N machines to one machine
|
||||||
|
- Bundle gen: runs on the consolidation machine
|
||||||
|
- **Alternative:** partition by `id % N` so each machine owns a shard end-to-end (WARC → icons → bundles), no consolidation needed. Bundle numbering uses non-overlapping ranges.
|
||||||
|
- **Infrastructure:** Terraform variable `ec2_count = N`, each instance gets same IAM/security group
|
||||||
|
- **Expected speedup:** ~linear (4 machines ≈ 4x faster ≈ 1 day for full scan)
|
||||||
|
|
||||||
|
**Simpler alternative to parallelization:**
|
||||||
|
- Download only `/favicon.ico` (skip link_rel icons) — cuts icon count from ~67M to ~30M, roughly halving the longest stage. Minimal quality loss since most best-selected icons are favicon.ico anyway.
|
||||||
|
- Use `c5n.xlarge` (25 Gbps NIC) instead of `c5.xlarge` (10 Gbps) — check with `iftop` if network is actually the bottleneck first.
|
||||||
|
|
||||||
|
## Phase 11: Automation
|
||||||
|
|
||||||
|
Monthly pipeline triggered by new Common Crawl release.
|
||||||
|
|
||||||
|
- Shell script wrapping terraform + pipeline stages + deploy
|
||||||
|
- Detect new crawl: cron job checking `https://index.commoncrawl.org/collinfo.json` weekly
|
||||||
|
- Compare latest crawl ID against last processed (stored in a file or S3 tag)
|
||||||
|
- On new crawl: terraform up → run pipeline → deploy → terraform down
|
||||||
|
- Notifications: email/webhook on success or failure
|
||||||
|
- Later: move to Forgejo CI with manual trigger button + scheduled trigger
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Future Improvements (Non-Blocking)
|
||||||
|
|
||||||
### Pipeline
|
### Pipeline
|
||||||
- **WARC parser: retry on fetch errors** — Currently 3 fetch errors out of 100K (tolerable loss). Could add 1 retry with backoff for transient S3 errors.
|
- **WARC parser: retry on fetch errors** — add 1 retry with backoff for transient S3 errors
|
||||||
- **WARC parser: batch DB inserts** — Currently one INSERT per icon. Using pgx batch or CopyFrom could improve DB write throughput and potentially unblock higher concurrency.
|
- **WARC parser: batch DB inserts** — pgx batch or CopyFrom for better write throughput
|
||||||
- **WARC parser: investigate throughput ceiling** — 300 hosts/sec at both 100 and 500 concurrency suggests a bottleneck. Profile to determine if it's S3 response latency, Postgres writes, or something else. For the full 30M run this determines wall-clock time (~28 hours at current rate).
|
- **Encoding: remaining garbled titles** — more aggressive charset detection heuristics
|
||||||
- **CC-Index query: c5.2xlarge for full run** — 8GB is tight with 6.4GB usage + swap. 16GB instance for the 30M-host full run.
|
- **Icon download: retry transient failures** — single retry for DNS/timeout errors
|
||||||
- **Encoding: investigate remaining garbled titles** — Some titles still show `<60>` in output (e.g., `BERGSTRANDS BAGERI <20>...`). These are pages that lie about their encoding. Could try more aggressive charset detection heuristics.
|
- **Bundle gen: SVG rasterization** — recover ~1,077 hosts with SVG-only favicons
|
||||||
- **Icon download: retry transient failures** — DNS and timeout failures could benefit from a single retry. Would recover a small percentage of icons.
|
- **Bundle gen: bilinear downscaling** — better quality than nearest-neighbor for >128px icons
|
||||||
- **Icon download: download large link_rel icons** — Currently skipping declared sizes >64x64. Re-run with broader filter for future high-res projects.
|
|
||||||
- **Bundle gen: SVG rasterization** — ~1,077 hosts have SVG-only favicons. Could add `rsvg-convert` or a Go SVG library to rasterize these.
|
|
||||||
- **Bundle gen: smarter downscaling** — Currently nearest-neighbor to 32x32 for >128px icons. Could use bilinear/Lanczos for better quality, or preserve aspect ratio for non-square icons.
|
|
||||||
|
|
||||||
### Frontend
|
### Frontend
|
||||||
- **Performance: reduce DOM / animation cost** — Pause marquee animation on off-screen rows (IntersectionObserver). Virtualize rows to reduce total DOM element count.
|
- **Cross-browser tab styling** — match real browser tabs more closely
|
||||||
- **Cross-browser tab styling** — Polish Chrome/Firefox/Safari tab appearances to more closely match real browser tabs. Test on actual browsers, use screenshots as reference.
|
- **Mobile layout** — responsive tab sizing, touch-friendly interaction
|
||||||
- **Mobile layout** — Current design assumes desktop viewport. Need responsive tab sizing and touch-friendly interaction.
|
- **Stats page** — pipeline stats rendered on the site
|
||||||
- **Build script** — `pipeline/06_frontend/build.sh` to inject TOTAL_BUNDLES and deploy to S3 + CloudFront invalidation.
|
|
||||||
- **Stats page** — Serve `stats.json` and render pipeline stats (host count, icon coverage, crawl date) on the site.
|
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue