updated plan.md, finished integration test

This commit is contained in:
Joe Lothan 2026-05-18 12:49:34 -04:00
parent 5b3f6a6870
commit c7e33defa2

84
PLAN.md
View file

@ -246,45 +246,43 @@ Files: `frontend/index.html` and `frontend/site.js`
--- ---
## Phase 6: Integration & End-to-End Test (100K) ## Phase 6: Integration & End-to-End Test (100K) [COMPLETED]
### Step 6.1: Run Full Pipeline (100K) ### Steps 6.1-6.3 [COMPLETED]
Execute all stages in sequence on EC2: Full clean end-to-end run from `terraform apply` to live site at everytab.site.
1. Verify hosts table has 100K entries (from Phase 1)
2. Run WARC parser (Phase 2) — should complete in minutes
3. Run icon downloader (Phase 3) — should complete in 10-30 minutes at 100K scale
4. Run best icon selection (Phase 4.1)
5. Run bundle generator (Phase 4.2-4.4)
6. Run frontend build (Phase 5.6)
**Validation:** Visit the CloudFront URL. The site should work: **Pipeline timing (100K hosts):**
- Tabs render with real favicons and titles | Stage | Duration |
- Clicking works (iframe + external) |-------|----------|
- Scrolling loads more tabs | CC-Index query | 13m11s |
- No JS console errors | WARC parsing | 4m55s |
| Icon download | 10m39s |
| Best icon selection | instant |
| Bundle generation | 1m32s |
| Frontend deploy | seconds |
| **Total pipeline** | **~31 minutes** |
### Step 6.2: Tune Parameters **Loss funnel:**
```
100,000 hosts
→ 93,432 with titles (6.6% loss)
→ 70,551 with icons selected (24.4% loss)
→ 69,306 with icons in bundles (1.8% convert errors)
→ 779 bundles, 165MB total, avg 217KB per bundle
```
Based on the 100K run: **Parameter review:**
- **ENTRIES_PER_BUNDLE:** Look at the live site. Does one bundle fill the screen? Too many tabs? Too few? Adjust. - `ENTRIES_PER_BUNDLE = 120` — fills the screen well, kept as-is
- **Concurrency:** Was the icon download memory-stable? CPU-bound or network-bound? Adjust goroutine pool size. - Icon download concurrency 200 — I/O bound at 350 icons/sec, increasing doesn't help
- **Timeouts:** What was the error distribution? Are timeouts too aggressive? Too lenient? - Timeouts 10s — good balance, 2,261 timeouts (1%) is acceptable
- **Icon selection:** Do the selected icons look good? Any weird sizes or broken images? - Icons look good on the live site
Update CLI flag defaults based on findings. **Infrastructure notes:**
- c5.xlarge (8GB) needs 4GB swap for the DuckDB query — gets OOM killed without it
### Step 6.3: Collect & Review Stats - DuckDB occasionally gets S3 503 (rate limit) — retry works
- `force_destroy = true` on icons bucket needed for clean teardown
Merge all `stats/*.json` into a single pipeline report. Review: - deploy.sh sed pattern must match `[0-9]*` not `.*` to avoid eating `</script>`
- Loss at each stage (domains → parsed → icons downloaded → icons selected → bundled)
- Time per stage
- Error patterns (are certain TLDs failing more? certain icon formats?)
- Storage usage (S3 icons bucket, S3 site bucket)
Identify any pipeline bugs or data quality issues. Fix before scaling up.
**Done when:** End-to-end works at 100K, parameters tuned, stats reviewed, bugs fixed.
--- ---
@ -545,6 +543,26 @@ On completion, each program prints a summary line and writes its stats JSON (wit
## Future Improvements ## Future Improvements
### Phase 6 — Completed 2026-05-18
**Changes from original plan:**
- Added 4GB swap file to EC2 bootstrap — DuckDB OOM kills without it on c5.xlarge.
- Added `force_destroy = true` to icons S3 bucket — terraform teardown fails otherwise when bucket has objects.
- Pipeline README with full sanity checks between each stage.
- Deploy script (`pipeline/06_frontend/deploy.sh`) automates frontend upload + CloudFront invalidation.
- CloudFront + ACM certificate + S3 bucket policy added to Terraform. Domain setup (Gandi ALIAS record) is one-time manual step.
**Lessons learned:**
- Pipeline is reproducible — second clean run produced nearly identical numbers to the first.
- DuckDB gets S3 503 errors occasionally (rate limiting). Retry works. May need `SET threads = 4` or retry logic for the full 30M run.
- ACM certificate validation is a chicken-and-egg with CloudFront — use `aws_acm_certificate_validation` resource to make Terraform wait for DNS validation before creating the distribution.
- deploy.sh sed must match `[0-9]*` not `.*` — the greedy match eats the closing `</script>` tag.
- Total wall-clock from `terraform apply` to live site: ~45 minutes (including bootstrap).
---
## Future Improvements
### Pipeline ### Pipeline
- **WARC parser: retry on fetch errors** — Currently 3 fetch errors out of 100K (tolerable loss). Could add 1 retry with backoff for transient S3 errors. - **WARC parser: retry on fetch errors** — Currently 3 fetch errors out of 100K (tolerable loss). Could add 1 retry with backoff for transient S3 errors.
- **WARC parser: batch DB inserts** — Currently one INSERT per icon. Using pgx batch or CopyFrom could improve DB write throughput and potentially unblock higher concurrency. - **WARC parser: batch DB inserts** — Currently one INSERT per icon. Using pgx batch or CopyFrom could improve DB write throughput and potentially unblock higher concurrency.