updated PLAN.md with another 3M run to test code changes

2026-05-19 13:42:19 -04:00 · 2026-05-19 13:42:19 -04:00 · 41c0eb5c49
commit 41c0eb5c49
parent a28cd2b056
1 changed files with 111 additions and 112 deletions
--- a/PLAN.md
+++ b/PLAN.md
@ -286,116 +286,11 @@ Full clean end-to-end run from `terraform apply` to live site at everytab.site.

 ---

-## Phase 7: Full-Scale Run (30M)
-
-### Step 7.1: Remove Limits, Re-run CC-Index Query
-
-Update the DuckDB query to remove `LIMIT 100000`. Re-run.
-
-Considerations:
- If httpfs takes >1hr, switch to downloading the parquet files first
- May need to increase RDS storage (30M rows with WARC paths ≈ 5-10GB)
- Monitor DuckDB memory usage
-
-**Validation:** `SELECT COUNT(*) FROM hosts;` shows ~30M rows.
-
-### Step 7.2: Run WARC Parser at Scale
-
-Run with full concurrency against 30M hosts. Expected time: 2-6 hours.
-
-Monitor:
- Throughput (hosts/sec)
- Error rate stability (should plateau, not climb)
- Postgres connection pool health
- Memory usage
-
-### Step 7.3: Run Icon Downloader at Scale
-
-This is the long pole — expected 12-48 hours.
-
-Monitor continuously:
- icons/sec rate
- DNS cache hit rate (check Unbound stats: `unbound-control stats`)
- S3 upload rate
- Error rate by type
- Completion percentage
-
-If too slow (projected >48hrs):
- Consider increasing concurrency (if memory allows)
- Consider spinning up fleet (add more EC2 instances running the same binary)
- Check if DNS is the bottleneck (Unbound stats)
- Check if S3 uploads are the bottleneck (batch or reduce HEAD checks)
-
-### Step 7.4: Best Icon Selection + Bundle Generation
-
-Run at full scale. Expected: 1-2 hours total.
-
-Monitor bundle sizes — verify they're in the expected range with `ENTRIES_PER_BUNDLE` from tuning.
-
-### Step 7.5: Rebuild Frontend + Deploy
-
-Run frontend build with the real bundle count. Invalidate CloudFront.
-
-**Validation:** Visit the live site. Browse around. Check:
- Tab variety (seeing diverse sites, not just one TLD)
- Icon quality (no broken images, reasonable sizes)
- Performance (bundles load quickly, no jank)
- Stats page / stats.json looks correct
-
-**Done when:** Full-scale site is live and working.
-
---
-
-## Phase 8: Backup & Teardown
-
-### Step 8.1: Backup RDS to Homelab
-
-```bash
-# On EC2 (fast connection to RDS):
-pg_dump -Fc $DATABASE_URL > everytab_dump.pgfc
-
-# Transfer to homelab (from EC2 or direct):
-scp everytab_dump.pgfc homelab:/backups/everytab/
-
-# On homelab, verify restore:
-pg_restore -d everytab_local everytab_dump.pgfc
-psql everytab_local -c "SELECT COUNT(*) FROM hosts; SELECT COUNT(*) FROM icons;"
-```
-
-### Step 8.2: Backup Icons S3 to Homelab
-
-```bash
-# From homelab (or EC2 as intermediary):
-aws s3 sync s3://everytab-icons/ /backups/everytab/icons/
-
-# Verify file count matches:
-ls /backups/everytab/icons/ | wc -l
-# Compare with: aws s3 ls s3://everytab-icons/ | wc -l
-```
-
-### Step 8.3: Verify & Teardown
-
-After confirming backups:
-
-```bash
-# Verify the live site still works (it only depends on everytab-site + CloudFront)
-curl -s https://your-cloudfront-domain.net/ | head
-
-# Teardown scanning infrastructure:
-aws rds delete-db-instance --db-instance-identifier everytab --skip-final-snapshot
-aws s3 rb s3://everytab-icons --force
-aws ec2 terminate-instances --instance-ids i-xxxxx
-```
-
-**Done when:** Only `everytab-site` S3 bucket + CloudFront remain running. Monthly cost: ~$2-4.
-
---
-
 ## Development Notes

 ### Execution Order

-Phases are sequential: 0 → 1 → 2 → 3 → 4 → 5 → 6 → 7 → 8. Frontend (Phase 5) uses real data from the 100K pipeline run. The only thing that can be developed ahead of time is writing Go code locally before EC2 is ready (compile-test locally, run on EC2).
+Phases are sequential: 0 → 1 → 2 → 3 → 4 → 5 → 6 Frontend (Phase 5) uses real data from the 100K pipeline run. The only thing that can be developed ahead of time is writing Go code locally before EC2 is ready (compile-test locally, run on EC2).

 ### Progress & Observability

@ -559,18 +454,119 @@ On completion, each program prints a summary line and writes its stats JSON (wit
 - deploy.sh sed must match `[0-9]*` not `.*` — the greedy match eats the closing `</script>` tag.
 - Total wall-clock from `terraform apply` to live site: ~45 minutes (including bootstrap).

+### Phase 7 (in progress) — 2026-05-19
+
+**Changes from original plan:**
+- Switched icon storage from S3 to local disk with sharded directories (`ab/cd/ef/hash`). Eliminates ~$175 in S3 PUT costs at 30M scale.
+- Downloading ALL icons (removed size filter) — full archive for posterity, filter at bundle generation time.
+- EBS bumped from 300GB to 1TB to hold full icon archive.
+- Dropped `ORDER BY md5(id::text)` from icon claim query — was causing multi-second query times at 3M+ rows, creating 30-second burst/stall cycles. Without ORDER BY, query is instant and workers stay saturated.
+- Batch size 200 → 5000, channel buffer = batch size. Fewer DB round-trips, workers always have work.
+- Pin EC2 AMI in tfvars to prevent Terraform from replacing the instance when Amazon publishes a new AMI.
+- Added CloudFront access logging to S3 (`everytab-logs` bucket).
+
+**Lessons learned:**
+- `ORDER BY md5(id::text)` is O(n) on the unscanned set — fine at 100K, catastrophic at 3M+. The md5 shuffle for "good crawler" behavior is unnecessary when downloading ~2 icons per host across 30M hosts.
+- Icon download throughput improved from 350/s to 408/s just by fixing the claim query bottleneck. The burst/stall pattern on iftop was the key diagnostic.
+- S3 503 "SlowDown" errors are per-bucket-partition, not per-client. At concurrency 100 against commoncrawl bucket, hit 2% error rate. Concurrency 50 eliminated them.
+- Terraform `data.aws_ami` lookup fetches the latest AMI every apply. Pin it in tfvars to avoid unexpected instance replacement during non-EC2 changes.
+- EBS cost is negligible ($10 for 1TB × 4 days) compared to S3 PUT costs ($175 for 35M PUTs). Local disk is always cheaper for temporary working storage.
+
+**3M run results:**
+- CC-Index: 3M hosts, ~13min
+- WARC parsing: 3M hosts, ~3hrs at concurrency 50 (reduced from 100 due to S3 503s)
+- Icon download: 6.4M icons, 4h21m at 408/s, 70% success, 53GB downloaded
+- Best icon selection: instant
+- Bundle generation: 2.7M hosts, 1h23m (~540 hosts/sec), 22,429 bundles, 4.7GB total, 215KB avg
+- Frontend deployed with 22,429 bundles to everytab.site
+
+**Bundle gen issues identified:**
+- OOM at 3M scale — original design loaded all hosts + base64 icons into memory. Fixed: streaming pagination (6000 hosts per page, write bundles incrementally).
+- Fit-and-start pattern: all 6000 conversions must complete before any bundles are written. One slow ICO decode blocks the entire page. S3 uploads are sequential.
+- CPU underutilized — PNG encoding is CPU-bound but the batch-convert-then-write pattern means cores idle during DB fetches and S3 uploads.
+- At current rate (540 hosts/sec), 30M hosts would take ~15hrs for bundle gen alone. Pipeline redesign needed.
+
+**Additional lessons learned:**
+- `random_order` column with `DEFAULT random()` on hosts table enables shuffled bundle generation without expensive ORDER BY. Indexed for fast pagination.
+- Bundle sizes are remarkably consistent across scales (215-217KB avg at 100K, 3M). The `entries-per-bundle = 120` parameter is well-tuned.
+- Convert errors at 3M: 4,124 (0.2%) — mostly SVGs and malformed ICOs. Acceptable loss.
+- 2.7M hosts bundled (not 3M) because ~300K had no title (excluded from bundles).
+
 ---

 ## Phase 7: 3M Scale Test

 Validates disk-based icon storage at scale and gets real timing estimates.

- Tear down current infra, bring up fresh (300GB EBS)
+- Tear down current infra, bring up fresh (1TB EBS)
 - Run full pipeline with `--limit 3000000`
 - Icon downloader writes to local disk (`--icons-dir icons`) instead of S3
+- Downloads ALL icons (no size filter) — full archive for posterity
 - Bundle gen reads from local disk (`--icons-dir icons`)
- **Expected:** ~8hrs total, ~30GB disk for icons
- **Watch for:** DuckDB LIMIT behavior, disk I/O during icon download, any OOM issues
+- **Watch for:** disk I/O, DuckDB LIMIT behavior, any OOM issues
+
+### Icon selection strategy (TODO: decide before Phase 8)
+
+Now that we download ALL icons (including large 192x192, 512x512, etc.), the best-icon selection for the live site needs thought. Current SQL picks by:
+1. Standard square sizes ≤64 → other squares ≤64 → non-square ≤64 → everything else
+2. Prefer PNG/GIF/ICO over WebP, exclude SVG
+3. Tiebreak by smaller file size
+
+This works but questions remain:
+- Should we prefer favicon.ico over link_rel when quality is similar? (favicon.ico is the universal fallback, link_rel might be higher quality but less reliable)
+- Should we downscale >128px to 32x32 in bundle gen, or let the browser handle it? (affects bundle size vs quality)
+- Should we have different strategies for different icon sources? (e.g., always use link_rel PNG 32x32 if available, fall back to favicon.ico)
+- At 30M scale, how much do large icons bloat total bundle size? Need data from the 3M run to decide.
+
+## Phase 7.2: Performance Fixes + 3M Re-test
+
+Code review and performance improvements before the full 30M run. Make changes, review, then re-run the full 3M pipeline to validate.
+
+### Bundle gen redesign: streaming pipeline
+
+Current architecture (batch-convert-then-write):
+```
+[fetch 6000 hosts] → [convert all 6000 icons] → [write bundles] → [fetch next 6000]
+                      ^^^^^^^^^^^^^^^^^^^^^^^^    ^^^^^^^^^^^^^^
+                      all cores busy              cores idle
+```
+
+Target architecture (fully pipelined):
+```
+[DB fetcher] → channel → [N converter workers] → channel → [bundle writer]
+   always                   always busy              writes as soon as
+   prefetching                                       120 entries ready
+```
+
+Changes:
+- **DB fetcher goroutine** — continuously fetches pages, feeds host channel. Same pattern as icon downloader.
+- **Converter workers** (200 goroutines) — read from host channel, read icon from disk, decode, re-encode PNG, base64, send BundleEntry to output channel.
+- **Bundle writer goroutine** — collects entries from output channel into a buffer. Every 120 entries, serialize JSON and upload to S3. Runs concurrently with conversion.
+
+This eliminates all synchronization points — DB, disk, CPU, and S3 are all utilized simultaneously. No fits and starts.
+
+### Other fixes to include
+- **WARC parser: add S3 retry with backoff** — currently concurrency 50 to avoid 503s. With retry, can go back to 100+ and handle transient 503s gracefully.
+- **Icon download: reduce timeout to 3-5s** — most legitimate servers respond in <1s. Dead hosts block a worker for 10s currently.
+- **Icon download: confirm all icons downloaded** — size filter already removed from claim query, but verify at 3M that all link_rel icons (including large declared sizes) are being downloaded.
+- **Icon selection strategy** — decide on final criteria (see Phase 7 notes above) and validate with 3M data.
+
+### Code review
+Before running the 3M re-test, do a full read-through of all Go code:
+- Check all error handling paths — are errors logged, counted, and surfaced?
+- Check concurrency patterns — any race conditions, deadlocks, goroutine leaks?
+- Check resource cleanup — are DB connections, file handles, HTTP responses closed?
+- Check the new streaming bundle gen — does it handle edge cases (empty pages, partial final bundle, S3 upload failures)?
+- Review all CLI flag defaults — are they tuned for the 30M run?
+
+### Validation
+- Run full 3M pipeline from scratch with all fixes
+- Compare timings against Phase 7 run
+- Confirm bundle gen saturates CPU and doesn't stall
+- Confirm icon download rate improves with shorter timeout
+- Confirm WARC parsing can run at higher concurrency with retry
+- Confirm all icon types (including large) are downloaded
+- Review live site at everytab.site with 3M data

 ## Phase 8: 30M Full Run (Single Machine)

@ -579,9 +575,9 @@ Full internet scan on one c5.xlarge.
 - `--limit 0` for CC-Index query
 - **Expected:** ~3-4 days total (WARC parsing ~25hrs, icon download ~50hrs)
 - Run in tmux, monitor with `psql` queries from another session
- **Expected disk:** ~200-300GB for icons
- **Cost:** ~$44 (EC2 + RDS + EBS for 4 days)
- After completion: deploy frontend, verify live site, backup icons to homelab via rsync
+- **Expected disk:** ~500GB-1TB for all icons (full archive)
+- **Cost:** ~$50 (EC2 + RDS + 1TB EBS for 4 days)
+- After completion: deploy frontend, verify live site, backup icons + DB to homelab via rsync

 ## Phase 9: Frontend Polish

@ -636,3 +632,6 @@ Monthly pipeline triggered by new Common Crawl release.
 - **Cross-browser tab styling** — match real browser tabs more closely
 - **Mobile layout** — responsive tab sizing, touch-friendly interaction
 - **Stats page** — pipeline stats rendered on the site
+
+### Icon Download Ordering
+- **Verify icon row ordering at 30M scale** — WARC parser inserts ~2-5 icons per host in sequence (favicon_ico + link_rel entries). Without ORDER BY, Postgres returns rows roughly in insertion order, so icons from the same host are adjacent. At 100K-3M this didn't cause problems (batch size 5000 means each batch has icons from ~2,000+ different hosts). At 30M, confirm with `iftop` that we're not hammering individual hosts. If needed, add `random_order REAL DEFAULT random()` column to icons table and use it in the claim query — but don't index it (60M+ writes).