updated PLAN.md and ARCHITECTURE.md with new instance type and performance concerns
This commit is contained in:
parent
b419b5bf6c
commit
baf657a8ed
2 changed files with 98 additions and 25 deletions
|
|
@ -66,8 +66,8 @@ All resources in **us-east-1**.
|
||||||
|
|
||||||
| Resource | Purpose | Lifecycle |
|
| Resource | Purpose | Lifecycle |
|
||||||
|----------|---------|-----------|
|
|----------|---------|-----------|
|
||||||
| EC2 (c5.xlarge) + 1TB EBS | Run all pipeline stages, store icon archive | Scanning only |
|
| EC2 (c5.2xlarge) + 1TB EBS | Run all pipeline stages, store icon archive | Scanning only |
|
||||||
| RDS Postgres (db.t3.medium) | Store hosts/icons metadata | Scanning only (backup to homelab, then delete) |
|
| RDS Postgres (db.m5.large) | Store hosts/icons metadata | Scanning only (backup to homelab, then delete) |
|
||||||
| S3 `everytab-site` | Static site: index.html, site.js, tabs/*.json | Permanent |
|
| S3 `everytab-site` | Static site: index.html, site.js, tabs/*.json | Permanent |
|
||||||
| CloudFront | CDN for static site (Brotli compression enabled) | Permanent |
|
| CloudFront | CDN for static site (Brotli compression enabled) | Permanent |
|
||||||
| S3 `everytab-logs` | CloudFront access logs | Permanent |
|
| S3 `everytab-logs` | CloudFront access logs | Permanent |
|
||||||
|
|
@ -117,6 +117,7 @@ Icons are stored on local disk during scanning, not S3. The EBS volume holds the
|
||||||
| s3_key | TEXT | SHA-256 hash of content (used as local file path, legacy column name) |
|
| s3_key | TEXT | SHA-256 hash of content (used as local file path, legacy column name) |
|
||||||
| scan_state | TEXT DEFAULT 'unscanned' | `unscanned`, `in_progress`, `completed`, `failed` |
|
| scan_state | TEXT DEFAULT 'unscanned' | `unscanned`, `in_progress`, `completed`, `failed` |
|
||||||
| error | TEXT | Error message if failed |
|
| error | TEXT | Error message if failed |
|
||||||
|
| downloaded_at | TIMESTAMPTZ | When the icon was fetched (NULL if not yet downloaded) |
|
||||||
|
|
||||||
**Indexes:**
|
**Indexes:**
|
||||||
- `CREATE INDEX idx_icons_unscanned ON icons(id) WHERE scan_state = 'unscanned'` — partial index for work claiming. Only indexes unscanned rows; shrinks as work completes. Minimal write overhead since index only updates on transition OUT of 'unscanned'.
|
- `CREATE INDEX idx_icons_unscanned ON icons(id) WHERE scan_state = 'unscanned'` — partial index for work claiming. Only indexes unscanned rows; shrinks as work completes. Minimal write overhead since index only updates on transition OUT of 'unscanned'.
|
||||||
|
|
@ -199,17 +200,23 @@ WHERE url_path = '/'
|
||||||
7. Insert all discovered `link rel="icon"` entries into `icons` (any format: ICO, PNG, GIF, SVG, WebP, JPEG)
|
7. Insert all discovered `link rel="icon"` entries into `icons` (any format: ICO, PNG, GIF, SVG, WebP, JPEG)
|
||||||
8. Update `hosts` row: html_title, iframe_allowed, parsed = TRUE
|
8. Update `hosts` row: html_title, iframe_allowed, parsed = TRUE
|
||||||
|
|
||||||
**Concurrency:** High — thousands of goroutines with a semaphore/pool. CC's S3 handles massive throughput.
|
**Architecture:** Three-stage pipeline:
|
||||||
|
|
||||||
**Error handling:** Malformed HTML → still extract what we can (partial title, partial icons). WARC fetch failure → log and skip (mark parsed = TRUE with NULL title to avoid retry loops). All errors logged with hostname for investigation.
|
```
|
||||||
|
[DB fetcher] → hostCh → [500 workers] → resultCh → [DB writer with pgx.Batch]
|
||||||
|
```
|
||||||
|
|
||||||
|
1. **DB fetcher** (1 goroutine): continuously pages through unparsed hosts (batch size 5000), feeds `hostCh`.
|
||||||
|
2. **Workers** (500 goroutines, configurable): fetch WARC from S3, parse HTML, update stats, send successful results to `resultCh`. I/O-bound on S3 latency.
|
||||||
|
3. **DB writer** (1 goroutine): collects results, flushes every 100 using `pgx.Batch` (~400 queries per DB round-trip). S3 retry with 6 attempts and exponential backoff for transient 503s.
|
||||||
|
|
||||||
|
**Error handling:** Malformed HTML → still extract what we can (partial title, partial icons). WARC fetch failure → log and skip (host stays `parsed = FALSE`, retryable on next run). Max 50 link_rel icons per host (defensive cap against adversarial pages).
|
||||||
|
|
||||||
**Icon URL handling:** Relative URLs resolved against `{protocol}://{hostname}/`. Absolute URLs kept as-is. Data URIs ignored.
|
**Icon URL handling:** Relative URLs resolved against `{protocol}://{hostname}/`. Absolute URLs kept as-is. Data URIs ignored.
|
||||||
|
|
||||||
**No scan_state needed:** CC's S3 is highly reliable. The `parsed` boolean is sufficient. If the process crashes mid-batch, re-run picks up where it left off (unparsed rows).
|
|
||||||
|
|
||||||
**Cost:** $0 (same Open Data program).
|
**Cost:** $0 (same Open Data program).
|
||||||
|
|
||||||
**Stats emitted:** Rows processed, titles extracted, icons found (by source: favicon_ico vs link_rel), icon format distribution, iframe restrictions found, parse failures, rows with no title.
|
**Stats emitted:** Rows processed, titles extracted, no-title count, icons found, iframe restrictions, fetch/parse errors, DB errors, panics.
|
||||||
|
|
||||||
### Stage 3: Icon Download
|
### Stage 3: Icon Download
|
||||||
|
|
||||||
|
|
@ -247,7 +254,7 @@ WHERE url_path = '/'
|
||||||
- Update icons row: s3_key (the SHA-256 hash), content_type (from actual data, not HTTP header), width, height, file_size, scan_state = 'completed'
|
- Update icons row: s3_key (the SHA-256 hash), content_type (from actual data, not HTTP header), width, height, file_size, scan_state = 'completed'
|
||||||
- On failure: scan_state = 'failed', error = reason
|
- On failure: scan_state = 'failed', error = reason
|
||||||
|
|
||||||
**Concurrency:** Channel-based worker pool (default 200 workers, configurable). Producer goroutine feeds a buffered channel (buffer = batch size), N workers consume. No starvation between batch claims.
|
**Concurrency:** Channel-based worker pool (default 2500 workers, configurable). Producer goroutine feeds a buffered channel (buffer = batch size), shuffles each batch to avoid hitting the same host back-to-back. N workers consume from the channel.
|
||||||
|
|
||||||
**Fast failure strategy:**
|
**Fast failure strategy:**
|
||||||
- DNS failure → fail immediately (Unbound will cache NXDOMAIN)
|
- DNS failure → fail immediately (Unbound will cache NXDOMAIN)
|
||||||
|
|
@ -336,6 +343,54 @@ Bundles are written in-place (overwriting previous run). No delete-first step, s
|
||||||
3. **Verify backups:** confirm pg_dump restores cleanly on homelab, spot-check icon files
|
3. **Verify backups:** confirm pg_dump restores cleanly on homelab, spot-check icon files
|
||||||
4. Tear down scanning infra: `terraform apply -var="scanning=false"` (deletes RDS, EC2, icons S3 bucket)
|
4. Tear down scanning infra: `terraform apply -var="scanning=false"` (deletes RDS, EC2, icons S3 bucket)
|
||||||
|
|
||||||
|
## Performance Characteristics
|
||||||
|
|
||||||
|
Each pipeline stage has different bottlenecks. Understanding these explains the concurrency choices and why certain stages can't be sped up further on a single machine.
|
||||||
|
|
||||||
|
### Stage 1: CC-Index Query
|
||||||
|
- **Download phase: network-bound.** `aws s3 sync` of ~166GB of parquet files. Throughput limited by EC2 network bandwidth (10 Gbps on c5.2xlarge). Takes ~10-15 minutes.
|
||||||
|
- **Query phase: memory-bound.** DuckDB loads the GROUP BY hash table into memory. At 30M output rows, the hash table approaches 16GB. `temp_directory` is set to EBS so DuckDB spills to NVMe efficiently (large sequential I/O) rather than relying on OS swap (random 4KB page faults). On c5.2xlarge (16GB RAM) with 8GB swap, the query completes without severe thrashing.
|
||||||
|
- **Not CPU-bound** — DuckDB's columnar scan is efficient, CPU cores are underutilized during the query.
|
||||||
|
|
||||||
|
### Stage 2: WARC Parsing
|
||||||
|
- **CPU-bound + network I/O-bound (S3).** Each WARC fetch is a byte-range S3 GetObject request (~100-200ms round-trip), but TLS handshakes + gzip decompression + HTML parsing consume significant CPU. At 500 goroutines on 4 cores, CPU was at 100%. On c5.2xlarge (8 cores), more workers can actually compute simultaneously.
|
||||||
|
- **DB writes batched** via `pgx.Batch` — 500 results (~2000 queries) per round-trip. Non-burstable RDS (db.m5.large) provides consistent write performance. Burstable t3 instances throttle under sustained load and cause pipeline stalls via channel back-pressure.
|
||||||
|
- **Channel buffers sized to prevent stalls** — hostCh (20K) gives the DB fetcher enough runway between queries. resultCh (1K) absorbs write latency spikes.
|
||||||
|
- **S3 retry** with 6 attempts and exponential backoff handles transient 503s from the `commoncrawl` bucket.
|
||||||
|
- **Measured: 566 hosts/sec** at concurrency 500 on c5.xlarge (4 cores). Expected ~1000+ hosts/sec on c5.2xlarge (8 cores).
|
||||||
|
|
||||||
|
### Stage 3: Icon Download
|
||||||
|
- **Network I/O-bound (internet).** Downloading from millions of different web servers worldwide. Latency varies wildly (1ms to 10s). The long tail of slow/dead servers dominates — most icons download in <500ms but timeouts (10s) hold workers.
|
||||||
|
- **The long pole of the pipeline** — longest stage at 30M scale.
|
||||||
|
- **5000 concurrent goroutines** to keep throughput high despite variable latency. Not CPU-bound (magic byte checks and SHA-256 are fast). Not DB-bound (one write per icon at ~1ms, self-smoothing due to random server latencies).
|
||||||
|
- **Memory is the concurrency limit** — each goroutine holds a TCP connection + TLS session + icon data buffer. At 5000 workers on c5.2xlarge (16GB), ~2-3GB for connection overhead — comfortable.
|
||||||
|
- **Disk I/O is negligible** — icons are small (median ~5KB), writes are sharded across directories.
|
||||||
|
- **DNS is cached** — Unbound's aggressive caching (1.7GB cache, 3600s min-TTL) means repeat TLD/nameserver lookups are instant. First-seen domains incur recursive resolution (~50-100ms) but this is pipelined with the HTTP request.
|
||||||
|
- **Measured: 439 icons/sec** at concurrency 1000 on c5.xlarge. Expected to improve significantly at 5000 concurrency on c5.2xlarge.
|
||||||
|
|
||||||
|
### Stage 4: Best Icon Selection
|
||||||
|
- **CPU-bound (Postgres).** Single SQL query with `DISTINCT ON` and multi-column sort. Runs in seconds even at 30M — Postgres handles this efficiently with the `idx_icons_host_id` index.
|
||||||
|
|
||||||
|
### Stage 5: Bundle Generation
|
||||||
|
- **CPU-bound (image conversion).** Decoding icons (especially ICO) and re-encoding as PNG is the bottleneck. 40 converter goroutines on c5.2xlarge (8 cores) keep all cores saturated. More goroutines don't help — they just compete for cores.
|
||||||
|
- **Disk I/O is secondary** — reading small icon files from the sharded directory. Usually cached in the OS page cache after first access.
|
||||||
|
- **S3 uploads are pipelined** — 10 upload workers hide the ~50-100ms PUT latency. The assembler serializes bundles while previous uploads are in flight.
|
||||||
|
- **DB reads are pipelined** — the fetcher goroutine prefetches pages while converters work, so workers never wait for DB.
|
||||||
|
- **Measured: 2,377 hosts/sec** at concurrency 20 on c5.xlarge (4 cores). Expected ~4500+ hosts/sec at concurrency 40 on c5.2xlarge.
|
||||||
|
|
||||||
|
### Stage 6: Frontend Deploy
|
||||||
|
- **Network-bound.** 4 small file uploads to S3 + CloudFront invalidation. Seconds.
|
||||||
|
|
||||||
|
### Summary: what would make each stage faster
|
||||||
|
|
||||||
|
| Stage | Current bottleneck | To speed up further |
|
||||||
|
|-------|-------------------|-------------|
|
||||||
|
| CC-Index | Memory (DuckDB hash table spill) | Streaming dedup via INSERT ON CONFLICT, or more RAM |
|
||||||
|
| WARC parsing | CPU + S3 latency | More cores, or multiple EC2 instances |
|
||||||
|
| Icon download | Internet latency (slow/dead servers) | Multiple EC2 instances |
|
||||||
|
| Bundle gen | CPU (image decode/encode) | More cores, or better image libraries |
|
||||||
|
| Deploy | N/A | Already seconds |
|
||||||
|
|
||||||
## DNS Architecture
|
## DNS Architecture
|
||||||
|
|
||||||
**Unbound** runs on the EC2 instance as the system DNS resolver.
|
**Unbound** runs on the EC2 instance as the system DNS resolver.
|
||||||
|
|
@ -479,13 +534,13 @@ This is served publicly at `/stats.json` on the live site — interesting metada
|
||||||
|
|
||||||
| Item | Estimate |
|
| Item | Estimate |
|
||||||
|------|----------|
|
|------|----------|
|
||||||
| EC2 c5.xlarge (~3-4 days) | $12-16 |
|
| EC2 c5.2xlarge (~2-3 days) | $16-24 |
|
||||||
| EBS 1TB gp3 (~4 days) | $10 |
|
| EBS 1TB gp3 (~3 days) | $8 |
|
||||||
| RDS db.t3.medium (~4 days) | $4-6 |
|
| RDS db.m5.large (~3 days) | $12-15 |
|
||||||
| Common Crawl S3 reads (CC-Index + WARCs) | $0 (Open Data) |
|
| Common Crawl S3 reads (CC-Index + WARCs) | $0 (Open Data) |
|
||||||
| Data transfer (icon downloads from internet, inbound) | $0 (inbound free) |
|
| Data transfer (icon downloads from internet, inbound) | $0 (inbound free) |
|
||||||
| Data transfer (backup to homelab, outbound) | $5-45 (depends on icon archive size) |
|
| Data transfer (backup to homelab, outbound) | $5-59 (depends on icon archive size) |
|
||||||
| **Total** | **~$31-77** |
|
| **Total** | **~$41-106** |
|
||||||
|
|
||||||
### Hosting Phase (Monthly Steady-State)
|
### Hosting Phase (Monthly Steady-State)
|
||||||
|
|
||||||
|
|
|
||||||
42
PLAN.md
42
PLAN.md
|
|
@ -606,25 +606,43 @@ Adversarial code review followed by performance improvements, validated with a 3
|
||||||
→ 2,358 bundles, 603MB total
|
→ 2,358 bundles, 603MB total
|
||||||
```
|
```
|
||||||
|
|
||||||
## Phase 8: 30M Full Run (Single Machine)
|
## Phase 8: 30M Full Run [IN PROGRESS]
|
||||||
|
|
||||||
Full internet scan on one c5.xlarge.
|
Full internet scan. Upgraded to c5.2xlarge + db.m5.large after 300K run revealed bottlenecks.
|
||||||
|
|
||||||
|
### Infrastructure changes (mid-run)
|
||||||
|
- **EC2: c5.xlarge → c5.2xlarge** (8 vCPUs, 16GB RAM). WARC parsing was CPU-bound at 100% on 4 cores. Icon download was memory-limited at 2500 concurrent connections. Bundle gen CPU-bound. The extra cores and RAM benefit all three stages.
|
||||||
|
- **RDS: db.t3.medium → db.m5.large** (non-burstable, 2 vCPUs, 8GB RAM). The t3 burstable instance was getting CPU-throttled under sustained write load, causing the WARC parser's DB writer to stall and back-pressure workers.
|
||||||
|
- **Swap: 4GB → 8GB** (2x physical RAM as safety margin).
|
||||||
|
- **DuckDB temp_directory** set to EBS (`~/duckdb_temp`) instead of defaulting to tmpfs. DuckDB's managed spill-to-disk is far more efficient than OS swap — sequential large reads vs random 4KB page faults.
|
||||||
|
|
||||||
|
### Tuning changes
|
||||||
|
- **WARC parser concurrency: 500 → 500** (kept, but now on 8 cores instead of 4 — more actual throughput)
|
||||||
|
- **WARC parser write batch: 100 → 500** (~2000 queries per DB round-trip). Fewer flushes = less back-pressure on workers.
|
||||||
|
- **WARC parser startup**: removed slow `COUNT(*) WHERE parsed = FALSE` query (scans 26M-row index, takes minutes). Not needed — fetcher discovers empty results naturally.
|
||||||
|
- **WARC parser channel buffers**: hostCh 5K → 20K, resultCh 500 → 1K. Prevents micro-stalls between DB fetcher queries.
|
||||||
|
- **Icon download concurrency: 1000 → 5000** (16GB RAM supports the connection overhead).
|
||||||
|
- **Icon download channel buffer**: 5K → 20K.
|
||||||
|
- **Bundle gen concurrency: 20 → 40** (8 cores × 5).
|
||||||
|
- **Bundle gen channel buffers**: 1.2K → 6K.
|
||||||
|
- **Debug logging** added to WARC parser fetcher and writer to diagnose stalls.
|
||||||
|
|
||||||
|
### Run parameters
|
||||||
- `--limit 0` for CC-Index query
|
- `--limit 0` for CC-Index query
|
||||||
- **Expected (extrapolated from 300K run):**
|
- **CC-Index result:** 26,703,146 hosts (20.6M https, 6.1M http)
|
||||||
- CC-Index: ~10min (download) + ~15min (query) — possibly much longer at 30M due to swap thrashing
|
|
||||||
- WARC parsing: ~14-15hrs (566 hosts/sec)
|
|
||||||
- Icon download: ~50-60hrs (439 icons/sec at 1000 concurrency, the long pole — 2500 concurrency may improve this)
|
|
||||||
- Bundle gen: ~3.5hrs (2,377 hosts/sec)
|
|
||||||
- **Total: ~3 days**
|
|
||||||
- Run in tmux, monitor with `psql` queries from another session
|
- Run in tmux, monitor with `psql` queries from another session
|
||||||
- **Expected disk:** ~650GB for all icons (6.5GB per 300K × 100)
|
- **Expected disk:** ~650GB for all icons
|
||||||
- **Cost:** ~$50 (EC2 + RDS + 1TB EBS for 3-4 days)
|
- **Expected time:** ~2 days (with upgraded infra, down from ~3 days estimate)
|
||||||
|
- **Cost:** ~$65 (EC2 c5.2xlarge + RDS m5.large + 1TB EBS for 2-3 days)
|
||||||
- After completion: deploy frontend, verify live site, backup icons + DB to homelab via rsync
|
- After completion: deploy frontend, verify live site, backup icons + DB to homelab via rsync
|
||||||
- **Stuck icon recovery** (if icon download crashes): `UPDATE icons SET scan_state = 'unscanned' WHERE scan_state = 'in_progress';`
|
- **Stuck icon recovery** (if icon download crashes): `UPDATE icons SET scan_state = 'unscanned' WHERE scan_state = 'in_progress';`
|
||||||
|
|
||||||
### Consider c5.2xlarge for future runs
|
### Lessons learned (during 30M run)
|
||||||
The CC-Index DuckDB query is memory-bound — at 30M the GROUP BY hash table exceeds 8GB and swap thrashing dominates query time. c5.2xlarge (16GB, 8 vCPUs) would eliminate swap pressure entirely and double CPU cores for bundle gen. Cost difference: $0.17/hr → $0.34/hr, but if it halves the CC-Index query time and speeds up bundle gen (CPU-bound), the total EC2 hours may decrease enough to break even. Also benefits WARC parsing (more headroom for 500+ goroutines) and icon download (more memory for 5000 concurrent connections). Worth testing on a future run.
|
- **Burstable DB instances are unsuitable for pipeline workloads.** The t3.medium throttled under sustained writes, causing stalls that propagated through the entire WARC parser pipeline via channel back-pressure. Non-burstable m5 instances provide consistent performance.
|
||||||
|
- **WARC parsing is CPU-bound, not just I/O-bound.** At 500 goroutines on 4 cores, CPU was at 100% — TLS handshakes + gzip decompression + HTML parsing add up. More cores directly increases throughput.
|
||||||
|
- **Channel buffer sizing matters.** Small buffers (5K) caused micro-stalls every time the DB fetcher ran a query. 20K buffers give the fetcher enough runway to query without starving workers.
|
||||||
|
- **DuckDB temp_directory is critical at scale.** Without it, DuckDB spills to tmpfs (RAM-backed), which then swaps to disk via the OS — double indirection. Pointing temp_directory at EBS lets DuckDB manage spill efficiently with large sequential I/O.
|
||||||
|
- **COUNT(*) on large partial indexes is expensive.** The startup query `SELECT COUNT(*) FROM hosts WHERE parsed = FALSE` on 26M rows took minutes. Unnecessary — just start processing and discover completion naturally.
|
||||||
|
|
||||||
## Phase 9: Frontend Polish
|
## Phase 9: Frontend Polish
|
||||||
|
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue