From c7e33defa20bee80d81628a95c48ec65b4f70178 Mon Sep 17 00:00:00 2001 From: Joe Lothan Date: Mon, 18 May 2026 12:49:34 -0400 Subject: [PATCH] updated plan.md, finished integration test --- PLAN.md | 84 ++++++++++++++++++++++++++++++++++----------------------- 1 file changed, 51 insertions(+), 33 deletions(-) diff --git a/PLAN.md b/PLAN.md index 42caaba..14c7374 100644 --- a/PLAN.md +++ b/PLAN.md @@ -246,45 +246,43 @@ Files: `frontend/index.html` and `frontend/site.js` --- -## Phase 6: Integration & End-to-End Test (100K) +## Phase 6: Integration & End-to-End Test (100K) [COMPLETED] -### Step 6.1: Run Full Pipeline (100K) +### Steps 6.1-6.3 [COMPLETED] -Execute all stages in sequence on EC2: -1. Verify hosts table has 100K entries (from Phase 1) -2. Run WARC parser (Phase 2) — should complete in minutes -3. Run icon downloader (Phase 3) — should complete in 10-30 minutes at 100K scale -4. Run best icon selection (Phase 4.1) -5. Run bundle generator (Phase 4.2-4.4) -6. Run frontend build (Phase 5.6) +Full clean end-to-end run from `terraform apply` to live site at everytab.site. -**Validation:** Visit the CloudFront URL. The site should work: -- Tabs render with real favicons and titles -- Clicking works (iframe + external) -- Scrolling loads more tabs -- No JS console errors +**Pipeline timing (100K hosts):** +| Stage | Duration | +|-------|----------| +| CC-Index query | 13m11s | +| WARC parsing | 4m55s | +| Icon download | 10m39s | +| Best icon selection | instant | +| Bundle generation | 1m32s | +| Frontend deploy | seconds | +| **Total pipeline** | **~31 minutes** | -### Step 6.2: Tune Parameters +**Loss funnel:** +``` +100,000 hosts + → 93,432 with titles (6.6% loss) + → 70,551 with icons selected (24.4% loss) + → 69,306 with icons in bundles (1.8% convert errors) + → 779 bundles, 165MB total, avg 217KB per bundle +``` -Based on the 100K run: -- **ENTRIES_PER_BUNDLE:** Look at the live site. Does one bundle fill the screen? Too many tabs? Too few? Adjust. -- **Concurrency:** Was the icon download memory-stable? CPU-bound or network-bound? Adjust goroutine pool size. -- **Timeouts:** What was the error distribution? Are timeouts too aggressive? Too lenient? -- **Icon selection:** Do the selected icons look good? Any weird sizes or broken images? +**Parameter review:** +- `ENTRIES_PER_BUNDLE = 120` — fills the screen well, kept as-is +- Icon download concurrency 200 — I/O bound at 350 icons/sec, increasing doesn't help +- Timeouts 10s — good balance, 2,261 timeouts (1%) is acceptable +- Icons look good on the live site -Update CLI flag defaults based on findings. - -### Step 6.3: Collect & Review Stats - -Merge all `stats/*.json` into a single pipeline report. Review: -- Loss at each stage (domains → parsed → icons downloaded → icons selected → bundled) -- Time per stage -- Error patterns (are certain TLDs failing more? certain icon formats?) -- Storage usage (S3 icons bucket, S3 site bucket) - -Identify any pipeline bugs or data quality issues. Fix before scaling up. - -**Done when:** End-to-end works at 100K, parameters tuned, stats reviewed, bugs fixed. +**Infrastructure notes:** +- c5.xlarge (8GB) needs 4GB swap for the DuckDB query — gets OOM killed without it +- DuckDB occasionally gets S3 503 (rate limit) — retry works +- `force_destroy = true` on icons bucket needed for clean teardown +- deploy.sh sed pattern must match `[0-9]*` not `.*` to avoid eating `` --- @@ -545,6 +543,26 @@ On completion, each program prints a summary line and writes its stats JSON (wit ## Future Improvements +### Phase 6 — Completed 2026-05-18 + +**Changes from original plan:** +- Added 4GB swap file to EC2 bootstrap — DuckDB OOM kills without it on c5.xlarge. +- Added `force_destroy = true` to icons S3 bucket — terraform teardown fails otherwise when bucket has objects. +- Pipeline README with full sanity checks between each stage. +- Deploy script (`pipeline/06_frontend/deploy.sh`) automates frontend upload + CloudFront invalidation. +- CloudFront + ACM certificate + S3 bucket policy added to Terraform. Domain setup (Gandi ALIAS record) is one-time manual step. + +**Lessons learned:** +- Pipeline is reproducible — second clean run produced nearly identical numbers to the first. +- DuckDB gets S3 503 errors occasionally (rate limiting). Retry works. May need `SET threads = 4` or retry logic for the full 30M run. +- ACM certificate validation is a chicken-and-egg with CloudFront — use `aws_acm_certificate_validation` resource to make Terraform wait for DNS validation before creating the distribution. +- deploy.sh sed must match `[0-9]*` not `.*` — the greedy match eats the closing `` tag. +- Total wall-clock from `terraform apply` to live site: ~45 minutes (including bootstrap). + +--- + +## Future Improvements + ### Pipeline - **WARC parser: retry on fetch errors** — Currently 3 fetch errors out of 100K (tolerable loss). Could add 1 retry with backoff for transient S3 errors. - **WARC parser: batch DB inserts** — Currently one INSERT per icon. Using pgx batch or CopyFrom could improve DB write throughput and potentially unblock higher concurrency.