updated plan.md after 3M test
This commit is contained in:
parent
8dce702e8d
commit
b419b5bf6c
1 changed files with 102 additions and 54 deletions
156
PLAN.md
156
PLAN.md
|
|
@ -494,16 +494,31 @@ On completion, each program prints a summary line and writes its stats JSON (wit
|
|||
|
||||
---
|
||||
|
||||
## Phase 7: 3M Scale Test
|
||||
## Phase 7: 3M Scale Test [COMPLETED]
|
||||
|
||||
Validates disk-based icon storage at scale and gets real timing estimates.
|
||||
Validated disk-based icon storage, performance tuning, and full pipeline at 3M scale.
|
||||
|
||||
- Tear down current infra, bring up fresh (1TB EBS)
|
||||
- Run full pipeline with `--limit 3000000`
|
||||
- Icon downloader writes to local disk (`--icons-dir icons`) instead of S3
|
||||
- Downloads ALL icons (no size filter) — full archive for posterity
|
||||
- Bundle gen reads from local disk (`--icons-dir icons`)
|
||||
- **Watch for:** disk I/O, DuckDB LIMIT behavior, any OOM issues
|
||||
**Final 3M pipeline results:**
|
||||
|
||||
| Stage | Duration | Result |
|
||||
|-------|----------|--------|
|
||||
| CC-Index query | ~13min | 3M hosts |
|
||||
| WARC parsing | ~3hrs (concurrency 50) | 2.8M titles, 6M icons |
|
||||
| Icon download | 4h21m (408/s) | 4.5M completed, 53GB, 70% success |
|
||||
| Best icon selection | instant | 2M hosts with icons |
|
||||
| Bundle generation | 1h23m (540 hosts/sec) | 22,429 bundles, 4.7GB |
|
||||
| Frontend deploy | seconds | Live at everytab.site |
|
||||
| **Total** | **~9 hours** | |
|
||||
|
||||
**Key changes during this phase:**
|
||||
- Icons stored on local disk (sharded `ab/cd/ef/hash`), not S3 — saves ~$175 in PUT costs
|
||||
- Removed icon size filter — downloads ALL icons for archival, filters at bundle gen time
|
||||
- Dropped `ORDER BY md5(id::text)` from icon claim query — was causing 30-second burst/stall cycles at 3M scale
|
||||
- Icon download batch size 200 → 5000, channel buffer = batch size
|
||||
- Bundle gen rewritten to stream: paginated DB reads, incremental bundle writes (fixed OOM at 3M)
|
||||
- `random_order` column on hosts table for shuffled bundles
|
||||
- EBS volume sized at 1TB for full icon archive
|
||||
- Added COSTS.md with monthly cost breakdown (~$42/month ongoing)
|
||||
|
||||
### Icon selection strategy (TODO: decide before Phase 8)
|
||||
|
||||
|
|
@ -518,66 +533,98 @@ This works but questions remain:
|
|||
- Should we have different strategies for different icon sources? (e.g., always use link_rel PNG 32x32 if available, fall back to favicon.ico)
|
||||
- At 30M scale, how much do large icons bloat total bundle size? Need data from the 3M run to decide.
|
||||
|
||||
## Phase 7.2: Performance Fixes + 3M Re-test
|
||||
## Phase 7.2: Code Review + Performance Fixes [COMPLETED]
|
||||
|
||||
Code review and performance improvements before the full 30M run. Make changes, review, then re-run the full 3M pipeline to validate.
|
||||
Adversarial code review followed by performance improvements, validated with a 300K run.
|
||||
|
||||
### Bundle gen redesign: streaming pipeline
|
||||
### Code review findings and fixes
|
||||
- **float32 pagination bug** — `random_order REAL` on hosts table caused ~1.5% data loss at 30M scale due to float32 collisions in keyset pagination. Fixed: `DOUBLE PRECISION` + `float64` in Go.
|
||||
- **Protocol missing from bundles** — bundle JSON had `host` but no protocol. HTTP-only sites (23%) loaded as `https://` and broke. Fixed: replaced `host` field with `url` (full URL built on Go side).
|
||||
- **Non-atomic bundle deployment** — bundle gen deleted all S3 bundles before writing new ones. Crash mid-write = broken live site. Fixed: overwrite in-place, deploy.sh cleans up stale bundles after cache invalidation.
|
||||
- **ARCHITECTURE.md stale** — still described S3 icon storage, old claim query, old bundle format. Updated throughout to match current code.
|
||||
- **Dead go.mod dependencies** — progressbar and transitive deps removed. Direct vs indirect annotations fixed via `go mod tidy`.
|
||||
- **Shadowed builtins** — custom `min()`/`max()` functions removed (Go 1.21+ builtins).
|
||||
- **BMP decoder missing** — standalone BMP favicons passed download but failed in bundle gen. Added `golang.org/x/image/bmp` import.
|
||||
- **Frontend memory leak** — `loadedIcons` array grew unboundedly. Capped at 100 entries.
|
||||
- **Iframe stats inflated** — error hosts counted as "iframe blocked" (zero value of bool). Fixed to only count successful parses.
|
||||
- **CSP check incomplete** — only checked first `Content-Security-Policy` header. Fixed to check all headers via `headers.Values()`.
|
||||
- **DNS error classification** — direct type assertion `err.(*net.DNSError)` never matched wrapped errors. Fixed with `errors.As()`.
|
||||
- **Icon download host hammering** — adjacent same-host icons in batches caused simultaneous requests. Fixed: Fisher-Yates shuffle of each batch before feeding to workers.
|
||||
- **Max icons per host** — capped at 50 link_rel icons per host in HTML parser to prevent adversarial pages from bloating the DB.
|
||||
- **`downloaded_at` column** — added to icons table for data freshness tracking.
|
||||
|
||||
Current architecture (batch-convert-then-write):
|
||||
### Pipeline performance redesign
|
||||
|
||||
**WARC parser** — three-stage pipeline:
|
||||
```
|
||||
[fetch 6000 hosts] → [convert all 6000 icons] → [write bundles] → [fetch next 6000]
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^
|
||||
all cores busy cores idle
|
||||
[DB fetcher] → hostCh → [500 workers] → resultCh → [DB writer with pgx.Batch]
|
||||
```
|
||||
- Channel-based worker pool (500 goroutines, up from 100 semaphore)
|
||||
- S3 retry with 6 attempts (AWS SDK `retry.AddWithMaxAttempts`)
|
||||
- Batched DB writes via `pgx.Batch` (100 results = ~400 queries per round-trip)
|
||||
- Result: 566 hosts/sec (1.6x improvement over 352/sec)
|
||||
|
||||
Target architecture (fully pipelined):
|
||||
**Bundle gen** — four-stage pipeline:
|
||||
```
|
||||
[DB fetcher] → channel → [N converter workers] → channel → [bundle writer]
|
||||
always always busy writes as soon as
|
||||
prefetching 120 entries ready
|
||||
[DB fetcher] → hostCh → [20 converters] → entryCh → [assembler] → uploadCh → [10 uploaders]
|
||||
```
|
||||
- Converters default 20 (CPU-bound, ~5x core count on c5.xlarge)
|
||||
- Separate upload workers for S3 PUT parallelism
|
||||
- Result: 2,377 hosts/sec (4.4x improvement over 540/sec)
|
||||
|
||||
Changes:
|
||||
- **DB fetcher goroutine** — continuously fetches pages, feeds host channel. Same pattern as icon downloader.
|
||||
- **Converter workers** (200 goroutines) — read from host channel, read icon from disk, decode, re-encode PNG, base64, send BundleEntry to output channel.
|
||||
- **Bundle writer goroutine** — collects entries from output channel into a buffer. Every 120 entries, serialize JSON and upload to S3. Runs concurrently with conversion.
|
||||
**Icon download** — in-memory batch shuffle added, concurrency bumped to 1000.
|
||||
|
||||
This eliminates all synchronization points — DB, disk, CPU, and S3 are all utilized simultaneously. No fits and starts.
|
||||
**CC-Index query** — downloads parquet files locally first (`aws s3 sync`), then queries with DuckDB. Eliminates S3 503 rate-limit failures.
|
||||
|
||||
### Other fixes to include
|
||||
- **WARC parser: add S3 retry with backoff** — currently concurrency 50 to avoid 503s. With retry, can go back to 100+ and handle transient 503s gracefully.
|
||||
- **Icon download: reduce timeout to 3-5s** — most legitimate servers respond in <1s. Dead hosts block a worker for 10s currently.
|
||||
- **Icon download: confirm all icons downloaded** — size filter already removed from claim query, but verify at 3M that all link_rel icons (including large declared sizes) are being downloaded.
|
||||
- **Icon selection strategy** — decide on final criteria (see Phase 7 notes above) and validate with 3M data.
|
||||
**Best icon selection** — new priority: target 32x32 for Retina display. Pick smallest icon ≥32px, fall back to largest <32px. No more "standard sizes" tiers.
|
||||
|
||||
### Code review
|
||||
Before running the 3M re-test, do a full read-through of all Go code:
|
||||
- Check all error handling paths — are errors logged, counted, and surfaced?
|
||||
- Check concurrency patterns — any race conditions, deadlocks, goroutine leaks?
|
||||
- Check resource cleanup — are DB connections, file handles, HTTP responses closed?
|
||||
- Check the new streaming bundle gen — does it handle edge cases (empty pages, partial final bundle, S3 upload failures)?
|
||||
- Review all CLI flag defaults — are they tuned for the 30M run?
|
||||
### Stats improvements
|
||||
- WARC parser: added `no_title` counter, fixed `icons_found` to include favicon.ico
|
||||
- Best icon selection: now writes `stats/04_best_icon.json`
|
||||
- Bundle gen: added `bundled_with_icon` / `bundled_no_icon` counters (distinguishes "never had icon" from "convert error")
|
||||
|
||||
### Validation
|
||||
- Run full 3M pipeline from scratch with all fixes
|
||||
- Compare timings against Phase 7 run
|
||||
- Confirm bundle gen saturates CPU and doesn't stall
|
||||
- Confirm icon download rate improves with shorter timeout
|
||||
- Confirm WARC parsing can run at higher concurrency with retry
|
||||
- Confirm all icon types (including large) are downloaded
|
||||
- Review live site at everytab.site with 3M data
|
||||
### 300K validation run results
|
||||
|
||||
| Stage | Duration | Rate |
|
||||
|-------|----------|------|
|
||||
| CC-Index query | 83s | — |
|
||||
| WARC parsing | 8m50s | 566 hosts/sec |
|
||||
| Icon download | 34m45s | 439 icons/sec |
|
||||
| Best icon selection | instant | — |
|
||||
| Bundle generation | 1m59s | 2,377 hosts/sec |
|
||||
| Frontend deploy | seconds | — |
|
||||
| **Total** | **~47 min** | |
|
||||
|
||||
**Loss funnel:**
|
||||
```
|
||||
300,000 hosts from CC-Index
|
||||
→ 282,854 with titles (94.3%)
|
||||
→ 213,656 bundled with icon (75.5% of titled)
|
||||
→ 69,198 bundled without icon (24.5%)
|
||||
→ 68,793 never had an icon
|
||||
→ 405 icon convert errors
|
||||
→ 2,358 bundles, 603MB total
|
||||
```
|
||||
|
||||
## Phase 8: 30M Full Run (Single Machine)
|
||||
|
||||
Full internet scan on one c5.xlarge.
|
||||
|
||||
- `--limit 0` for CC-Index query
|
||||
- **Expected:** ~3-4 days total (WARC parsing ~25hrs, icon download ~50hrs)
|
||||
- **Expected (extrapolated from 300K run):**
|
||||
- CC-Index: ~10min (download) + ~15min (query) — possibly much longer at 30M due to swap thrashing
|
||||
- WARC parsing: ~14-15hrs (566 hosts/sec)
|
||||
- Icon download: ~50-60hrs (439 icons/sec at 1000 concurrency, the long pole — 2500 concurrency may improve this)
|
||||
- Bundle gen: ~3.5hrs (2,377 hosts/sec)
|
||||
- **Total: ~3 days**
|
||||
- Run in tmux, monitor with `psql` queries from another session
|
||||
- **Expected disk:** ~500GB-1TB for all icons (full archive)
|
||||
- **Cost:** ~$50 (EC2 + RDS + 1TB EBS for 4 days)
|
||||
- **Expected disk:** ~650GB for all icons (6.5GB per 300K × 100)
|
||||
- **Cost:** ~$50 (EC2 + RDS + 1TB EBS for 3-4 days)
|
||||
- After completion: deploy frontend, verify live site, backup icons + DB to homelab via rsync
|
||||
- **Stuck icon recovery** (if icon download crashes): `UPDATE icons SET scan_state = 'unscanned' WHERE scan_state = 'in_progress';`
|
||||
|
||||
### Consider c5.2xlarge for future runs
|
||||
The CC-Index DuckDB query is memory-bound — at 30M the GROUP BY hash table exceeds 8GB and swap thrashing dominates query time. c5.2xlarge (16GB, 8 vCPUs) would eliminate swap pressure entirely and double CPU cores for bundle gen. Cost difference: $0.17/hr → $0.34/hr, but if it halves the CC-Index query time and speeds up bundle gen (CPU-bound), the total EC2 hours may decrease enough to break even. Also benefits WARC parsing (more headroom for 500+ goroutines) and icon download (more memory for 5000 concurrent connections). Worth testing on a future run.
|
||||
|
||||
## Phase 9: Frontend Polish
|
||||
|
||||
|
|
@ -621,17 +668,18 @@ Monthly pipeline triggered by new Common Crawl release.
|
|||
## Future Improvements (Non-Blocking)
|
||||
|
||||
### Pipeline
|
||||
- **WARC parser: retry on fetch errors** — add 1 retry with backoff for transient S3 errors
|
||||
- **WARC parser: batch DB inserts** — pgx batch or CopyFrom for better write throughput
|
||||
- **CC-Index query: streaming dedup** — current GROUP BY builds a ~30M-row hash table in memory, causing severe swap thrashing on c5.xlarge (8GB). Options: (1) use `INSERT ... ON CONFLICT (hostname) DO UPDATE` to stream rows into Postgres and let the UNIQUE constraint dedup, eliminating the hash table entirely; (2) process parquet files in smaller batches, dedup per-batch in DuckDB, final dedup in Postgres; (3) just use c5.2xlarge (16GB) to fit the hash table in RAM. Current workaround: `SET temp_directory` to let DuckDB spill to EBS instead of OS swap.
|
||||
- **Encoding: remaining garbled titles** — more aggressive charset detection heuristics
|
||||
- **Icon download: retry transient failures** — single retry for DNS/timeout errors
|
||||
- **Bundle gen: SVG rasterization** — recover ~1,077 hosts with SVG-only favicons
|
||||
- **Bundle gen: SVG rasterization** — shell out to `rsvg-convert` for SVG-only hosts (~3.5% of icons)
|
||||
- **Bundle gen: bilinear downscaling** — better quality than nearest-neighbor for >128px icons
|
||||
|
||||
### Frontend
|
||||
- **Cross-browser tab styling** — match real browser tabs more closely
|
||||
- **Mobile layout** — responsive tab sizing, touch-friendly interaction
|
||||
- **Stats page** — pipeline stats rendered on the site
|
||||
- **Stats page / Sankey diagram** — pipeline loss funnel rendered on the site
|
||||
|
||||
### Icon Download Ordering
|
||||
- **Verify icon row ordering at 30M scale** — WARC parser inserts ~2-5 icons per host in sequence (favicon_ico + link_rel entries). Without ORDER BY, Postgres returns rows roughly in insertion order, so icons from the same host are adjacent. At 100K-3M this didn't cause problems (batch size 5000 means each batch has icons from ~2,000+ different hosts). At 30M, confirm with `iftop` that we're not hammering individual hosts. If needed, add `random_order REAL DEFAULT random()` column to icons table and use it in the claim query — but don't index it (60M+ writes).
|
||||
### Additional Metadata
|
||||
- **`http_status INT` on icons** — structured HTTP status code (currently only stored as error text). Enables analysis like "404 (site moved) vs 403 (bot blocked) vs 500 (server error)".
|
||||
- **`response_time_ms INT` on icons** — server response latency. Useful for tuning timeouts, identifying slow hosts, health signal.
|
||||
- **`parsed_at TIMESTAMPTZ` on hosts** — when the WARC was parsed. Currently only a `parsed` boolean.
|
||||
- **`created_at TIMESTAMPTZ DEFAULT now()` on hosts** — when the host entered the pipeline.
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue