updated plan.md, finished integration test
This commit is contained in:
parent
5b3f6a6870
commit
c7e33defa2
1 changed files with 51 additions and 33 deletions
84
PLAN.md
84
PLAN.md
|
|
@ -246,45 +246,43 @@ Files: `frontend/index.html` and `frontend/site.js`
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Phase 6: Integration & End-to-End Test (100K)
|
## Phase 6: Integration & End-to-End Test (100K) [COMPLETED]
|
||||||
|
|
||||||
### Step 6.1: Run Full Pipeline (100K)
|
### Steps 6.1-6.3 [COMPLETED]
|
||||||
|
|
||||||
Execute all stages in sequence on EC2:
|
Full clean end-to-end run from `terraform apply` to live site at everytab.site.
|
||||||
1. Verify hosts table has 100K entries (from Phase 1)
|
|
||||||
2. Run WARC parser (Phase 2) — should complete in minutes
|
|
||||||
3. Run icon downloader (Phase 3) — should complete in 10-30 minutes at 100K scale
|
|
||||||
4. Run best icon selection (Phase 4.1)
|
|
||||||
5. Run bundle generator (Phase 4.2-4.4)
|
|
||||||
6. Run frontend build (Phase 5.6)
|
|
||||||
|
|
||||||
**Validation:** Visit the CloudFront URL. The site should work:
|
**Pipeline timing (100K hosts):**
|
||||||
- Tabs render with real favicons and titles
|
| Stage | Duration |
|
||||||
- Clicking works (iframe + external)
|
|-------|----------|
|
||||||
- Scrolling loads more tabs
|
| CC-Index query | 13m11s |
|
||||||
- No JS console errors
|
| WARC parsing | 4m55s |
|
||||||
|
| Icon download | 10m39s |
|
||||||
|
| Best icon selection | instant |
|
||||||
|
| Bundle generation | 1m32s |
|
||||||
|
| Frontend deploy | seconds |
|
||||||
|
| **Total pipeline** | **~31 minutes** |
|
||||||
|
|
||||||
### Step 6.2: Tune Parameters
|
**Loss funnel:**
|
||||||
|
```
|
||||||
|
100,000 hosts
|
||||||
|
→ 93,432 with titles (6.6% loss)
|
||||||
|
→ 70,551 with icons selected (24.4% loss)
|
||||||
|
→ 69,306 with icons in bundles (1.8% convert errors)
|
||||||
|
→ 779 bundles, 165MB total, avg 217KB per bundle
|
||||||
|
```
|
||||||
|
|
||||||
Based on the 100K run:
|
**Parameter review:**
|
||||||
- **ENTRIES_PER_BUNDLE:** Look at the live site. Does one bundle fill the screen? Too many tabs? Too few? Adjust.
|
- `ENTRIES_PER_BUNDLE = 120` — fills the screen well, kept as-is
|
||||||
- **Concurrency:** Was the icon download memory-stable? CPU-bound or network-bound? Adjust goroutine pool size.
|
- Icon download concurrency 200 — I/O bound at 350 icons/sec, increasing doesn't help
|
||||||
- **Timeouts:** What was the error distribution? Are timeouts too aggressive? Too lenient?
|
- Timeouts 10s — good balance, 2,261 timeouts (1%) is acceptable
|
||||||
- **Icon selection:** Do the selected icons look good? Any weird sizes or broken images?
|
- Icons look good on the live site
|
||||||
|
|
||||||
Update CLI flag defaults based on findings.
|
**Infrastructure notes:**
|
||||||
|
- c5.xlarge (8GB) needs 4GB swap for the DuckDB query — gets OOM killed without it
|
||||||
### Step 6.3: Collect & Review Stats
|
- DuckDB occasionally gets S3 503 (rate limit) — retry works
|
||||||
|
- `force_destroy = true` on icons bucket needed for clean teardown
|
||||||
Merge all `stats/*.json` into a single pipeline report. Review:
|
- deploy.sh sed pattern must match `[0-9]*` not `.*` to avoid eating `</script>`
|
||||||
- Loss at each stage (domains → parsed → icons downloaded → icons selected → bundled)
|
|
||||||
- Time per stage
|
|
||||||
- Error patterns (are certain TLDs failing more? certain icon formats?)
|
|
||||||
- Storage usage (S3 icons bucket, S3 site bucket)
|
|
||||||
|
|
||||||
Identify any pipeline bugs or data quality issues. Fix before scaling up.
|
|
||||||
|
|
||||||
**Done when:** End-to-end works at 100K, parameters tuned, stats reviewed, bugs fixed.
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
@ -545,6 +543,26 @@ On completion, each program prints a summary line and writes its stats JSON (wit
|
||||||
|
|
||||||
## Future Improvements
|
## Future Improvements
|
||||||
|
|
||||||
|
### Phase 6 — Completed 2026-05-18
|
||||||
|
|
||||||
|
**Changes from original plan:**
|
||||||
|
- Added 4GB swap file to EC2 bootstrap — DuckDB OOM kills without it on c5.xlarge.
|
||||||
|
- Added `force_destroy = true` to icons S3 bucket — terraform teardown fails otherwise when bucket has objects.
|
||||||
|
- Pipeline README with full sanity checks between each stage.
|
||||||
|
- Deploy script (`pipeline/06_frontend/deploy.sh`) automates frontend upload + CloudFront invalidation.
|
||||||
|
- CloudFront + ACM certificate + S3 bucket policy added to Terraform. Domain setup (Gandi ALIAS record) is one-time manual step.
|
||||||
|
|
||||||
|
**Lessons learned:**
|
||||||
|
- Pipeline is reproducible — second clean run produced nearly identical numbers to the first.
|
||||||
|
- DuckDB gets S3 503 errors occasionally (rate limiting). Retry works. May need `SET threads = 4` or retry logic for the full 30M run.
|
||||||
|
- ACM certificate validation is a chicken-and-egg with CloudFront — use `aws_acm_certificate_validation` resource to make Terraform wait for DNS validation before creating the distribution.
|
||||||
|
- deploy.sh sed must match `[0-9]*` not `.*` — the greedy match eats the closing `</script>` tag.
|
||||||
|
- Total wall-clock from `terraform apply` to live site: ~45 minutes (including bootstrap).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Future Improvements
|
||||||
|
|
||||||
### Pipeline
|
### Pipeline
|
||||||
- **WARC parser: retry on fetch errors** — Currently 3 fetch errors out of 100K (tolerable loss). Could add 1 retry with backoff for transient S3 errors.
|
- **WARC parser: retry on fetch errors** — Currently 3 fetch errors out of 100K (tolerable loss). Could add 1 retry with backoff for transient S3 errors.
|
||||||
- **WARC parser: batch DB inserts** — Currently one INSERT per icon. Using pgx batch or CopyFrom could improve DB write throughput and potentially unblock higher concurrency.
|
- **WARC parser: batch DB inserts** — Currently one INSERT per icon. Using pgx batch or CopyFrom could improve DB write throughput and potentially unblock higher concurrency.
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue