updated s3_key name to icon_hash

This commit is contained in:
Joe Lothan 2026-05-25 21:05:26 -04:00
parent e308718eb2
commit 33bd0a221e
8 changed files with 31 additions and 31 deletions

View file

@ -96,7 +96,7 @@ Icons are stored on local disk during scanning, not S3. The EBS volume holds the
| warc_record_length | INT NOT NULL | Length of WARC record |
| html_title | TEXT | Extracted from `<title>` tag |
| iframe_allowed | BOOLEAN | True if site allows framing |
| best_icon_s3_key | TEXT | SHA-256 hash of the chosen icon file (denormalized for fast bundle gen) |
| best_icon_hash | TEXT | SHA-256 hash of the chosen icon file (denormalized for fast bundle gen) |
| parsed | BOOLEAN DEFAULT FALSE | Whether WARC has been parsed |
| random_order | DOUBLE PRECISION DEFAULT random() | Random value for shuffled bundle generation pagination |
@ -114,7 +114,7 @@ Icons are stored on local disk during scanning, not S3. The EBS volume holds the
| width | INT | Best usable pixel width (for ICO: largest standard size ≤64; for SVG: NULL) |
| height | INT | Best usable pixel height (for ICO: largest standard size ≤64; for SVG: NULL) |
| file_size | INT | Size in bytes |
| s3_key | TEXT | SHA-256 hash of content (used as local file path, legacy column name) |
| icon_hash | TEXT | SHA-256 hash of content (used as local file path: `ab/cd/ef/{hash}`) |
| scan_state | TEXT DEFAULT 'unscanned' | `unscanned`, `in_progress`, `completed`, `failed` |
| error | TEXT | Error message if failed |
| downloaded_at | TIMESTAMPTZ | When the icon was fetched (NULL if not yet downloaded) |
@ -251,7 +251,7 @@ WHERE url_path = '/'
- SVG: store width=NULL, height=NULL (vector, no pixel size)
- Compute SHA-256 of content
- Write to local disk at `{icons_dir}/ab/cd/ef/{sha256}` (skip if file already exists — dedup)
- Update icons row: s3_key (the SHA-256 hash), content_type (from actual data, not HTTP header), width, height, file_size, scan_state = 'completed'
- Update icons row: icon_hash (the SHA-256 hash), content_type (from actual data, not HTTP header), width, height, file_size, scan_state = 'completed'
- On failure: scan_state = 'failed', error = reason
**Concurrency:** Channel-based worker pool (default 2500 workers, configurable). Producer goroutine feeds a buffered channel (buffer = batch size), shuffles each batch to avoid hitting the same host back-to-back. N workers consume from the channel.
@ -366,7 +366,7 @@ Each pipeline stage has different bottlenecks. Understanding these explains the
- **Memory is the concurrency limit** — each goroutine holds a TCP connection + TLS session + icon data buffer. At 5000 workers on c5.2xlarge (16GB), ~2-3GB for connection overhead — comfortable.
- **Disk I/O is negligible** — icons are small (median ~5KB), writes are sharded across directories.
- **DNS is cached** — Unbound's aggressive caching (1.7GB cache, 3600s min-TTL) means repeat TLD/nameserver lookups are instant. First-seen domains incur recursive resolution (~50-100ms) but this is pipelined with the HTTP request.
- **Measured: 439 icons/sec** at concurrency 1000 on c5.xlarge. Expected to improve significantly at 5000 concurrency on c5.2xlarge.
- **Measured: 2,136 icons/sec** at concurrency 5000 on c5.2xlarge (up from 439/sec at 1000 concurrency on c5.xlarge). CPU-bound at 90%.
### Stage 4: Best Icon Selection
- **CPU-bound (Postgres).** Single SQL query with `DISTINCT ON` and multi-column sort. Runs in seconds even at 30M — Postgres handles this efficiently with the `idx_icons_host_id` index.
@ -591,4 +591,4 @@ If the site gets significant traffic beyond CloudFront free tier, costs scale wi
9. **Per-millisecond random seed** — Every visitor sees a unique arrangement. No shared state, no server needed for randomization.
10. **Viewport-sized bundles** — ~100-150 tabs per bundle, tuned to fill a screen. Faster loads, smaller memory footprint than 1MB bundles.
11. **Include no-icon hosts** — Any host with a title is included. Firefox-style rendering (title only) for hosts without favicons.
12. **Denormalized best_icon_s3_key in hosts** — Stores the SHA-256 hash of the chosen icon. Avoids joins during bundle generation. Written once during icon selection, read once during bundling.
12. **Denormalized best_icon_hash in hosts** — Stores the SHA-256 hash of the chosen icon. Avoids joins during bundle generation. Written once during icon selection, read once during bundling.