diff --git a/PLAN.md b/PLAN.md index 7a57658..6b29002 100644 --- a/PLAN.md +++ b/PLAN.md @@ -179,104 +179,37 @@ Binary: `pipeline/03_icon_download/` (6 files: main.go, download.go, image.go, s --- -## Phase 4: Best Icon Selection & Bundle Generation (Stages 4-5) +## Phase 4: Best Icon Selection & Bundle Generation (Stages 4-5) [COMPLETED] -### Step 4.1: Best Icon Selection SQL +### Step 4.1: Best Icon Selection SQL [COMPLETED] -Write `pipeline/04_best_icon/select.sql`: +Script: `pipeline/04_best_icon/select.sql` -```sql -UPDATE hosts h SET best_icon_s3_key = sub.s3_key -FROM ( - SELECT DISTINCT ON (i.host_id) i.host_id, i.s3_key - FROM icons i - WHERE i.scan_state = 'completed' - ORDER BY i.host_id, - CASE - WHEN i.width = i.height AND i.width IN (64, 48, 32, 16) THEN 0 - WHEN i.width = i.height AND i.width <= 64 THEN 1 - WHEN i.width IS NOT NULL AND i.width <= 64 AND i.height <= 64 THEN 2 - ELSE 3 - END, - COALESCE(i.width, 0) DESC, - CASE - WHEN i.content_type IN ('image/png', 'image/gif', 'image/x-icon', 'image/vnd.microsoft.icon') THEN 0 - WHEN i.content_type = 'image/webp' THEN 1 - WHEN i.content_type = 'image/svg+xml' THEN 2 - ELSE 3 - END, - i.file_size ASC -) sub -WHERE h.id = sub.host_id; -``` +Selects the best icon per host using `DISTINCT ON` with priority ordering. Excludes SVGs (can't rasterize) and ≤2x2 icons (tracking pixels). See ARCHITECTURE.md for the full decision flow. -**Validation:** -- `SELECT COUNT(*) FROM hosts WHERE best_icon_s3_key IS NOT NULL;` — expect 60-80% of hosts -- `SELECT COUNT(*) FROM hosts WHERE html_title IS NOT NULL AND best_icon_s3_key IS NULL;` — hosts with title but no icon (will still be in bundles) -- Spot-check: for a few hosts, verify the selected icon is reasonable (correct size, valid) +**Result:** 70,366 hosts got an icon (72%), 23,018 have title but no icon. -**Stats:** `stats/04_best_icon.json` +### Steps 4.2-4.4: Bundle Generator [COMPLETED] -**Done when:** best_icon_s3_key populated for hosts that have valid icons. +Binary: `pipeline/05_bundle_gen/` (6 files: main.go, bundle.go, convert.go, db.go, s3.go, log.go) -### Step 4.2: Bundle Generator Go Program +**Architecture:** +- Queries all hosts with titles (randomized), concurrently downloads best icon from S3 icons bucket +- Uses `github.com/biessek/golang-ico` for ICO decoding (handles all bit depths including palette-based 1/4/8bpp) +- `image.Decode` handles PNG/GIF/JPEG/WebP/BMP/ICO via registered decoders. SVGs excluded. +- Icons >128px downscaled to 32x32 (nearest-neighbor). Icons ≤128px kept as-is. +- Re-encodes all icons as PNG, base64-encoded inline in bundle JSON. +- Panic recovery per icon conversion (malformed ICO files in the library) +- Concurrent S3 downloads with configurable concurrency (default 50) -``` -pipeline/05_bundle_gen/ -├── main.go # Entry point, CLI flags -├── db.go # Query hosts + icon keys -├── convert.go # Icon format conversion → PNG -├── bundle.go # Chunk + serialize JSON -└── s3.go # Upload bundles to everytab-site -``` +**CLI:** `./bundle_gen --db URL [--icons-bucket NAME] [--site-bucket NAME] [--entries-per-bundle N] [--concurrency N] [--limit N] [--dry-run] [--output-dir DIR] [--log-file PATH] [--log-errors-only]` -CLI flags: -- `--db` connection string -- `--icons-bucket` (default `everytab-icons`) -- `--site-bucket` (default `everytab-site`) -- `--entries-per-bundle` (tunable, start at 120) -- `--dry-run` (generate bundles to local disk, don't upload) -- `--limit` (only process N hosts, for testing) - -### Step 4.3: Icon Conversion Logic - -Implement format conversion to PNG: -1. Download icon from S3 by key -2. Detect format from magic bytes -3. Decode: - - PNG: decode directly - - ICO: parse container, extract image at recorded width/height, decode BMP or PNG within - - GIF/JPEG/BMP/WebP: decode to RGBA - - SVG: rasterize to 32x32 (use a Go SVG library, or shell out to `rsvg-convert` if simpler) -4. Re-encode as PNG (optimized, don't upscale) -5. Base64-encode - -**Test:** Convert 50 icons of mixed formats manually, verify output PNGs look correct. - -### Step 4.4: Bundle Assembly + Upload - -Implement: -1. Query all hosts WHERE html_title IS NOT NULL, randomize (ORDER BY random()) -2. For each host: fetch + convert its icon (or set empty string if no icon) -3. Assemble entries into chunks of `ENTRIES_PER_BUNDLE` -4. Serialize each chunk as JSON (`tabs/{n}.json`) -5. Upload to S3 `everytab-site/tabs/` -6. Record total bundle count - -**Dry-run:** Generate bundles to local disk, inspect a few: -- Valid JSON -- Icons render in browser (paste a data:image/png;base64,... URI) -- Entries have host, title, icon, icon_w, icon_h, iframe_ok - -**Validation:** -- Bundle files exist in S3 -- `aws s3 ls s3://everytab-site/tabs/ | wc -l` matches expected count -- Random bundle can be fetched and parsed as JSON -- Total hosts across all bundles = count of hosts with titles - -**Stats:** `stats/05_bundle_gen.json` - -**Done when:** All bundles uploaded to S3, JSON is valid, icons render. +**Result (93K hosts with titles, 70K with icons):** +- Duration: 1m30s +- Bundles created: 779 (120 entries each, last bundle partial) +- Total size: 165MB (avg 216KB per bundle) +- Convert errors: 1,263 (1,077 SVGs + 186 other — panics, truncated files, corrupt GIFs) +- S3: 779 JSON files in `everytab-site/tabs/` --- @@ -620,6 +553,23 @@ On completion, each program prints a summary line and writes its stats JSON (wit - Favicon download is I/O bound (network latency to diverse hosts worldwide). Concurrency helps up to a point, then the long tail of slow/dead servers dominates. 351 icons/sec at 200 concurrency. - Invalid image detection (magic bytes) catches ~5% of "successful" downloads that are actually HTML error pages served at `/favicon.ico`. +### Phase 4 — Completed 2026-05-18 + +**Changes from original plan:** +- Used `github.com/biessek/golang-ico` instead of hand-rolled ICO decoder. Handles all bit depths (1/4/8/24/32bpp) correctly. Eliminated ~20 ICO decode errors from the hand-rolled version. +- SVGs excluded from best-icon selection (can't rasterize without external deps). SVG-only hosts show up with no icon instead of failing at conversion time. +- Added ≤2x2 pixel exclusion from best-icon selection (tracking pixels / garbage favicons). +- Icons >128px downscaled to 32x32 during bundle generation. Icons ≤128px (including 80x80) kept as-is — browser CSS handles display scaling. +- Added panic recovery around icon conversion (the ICO library panics on some malformed files). +- Added concurrency for S3 icon downloads during bundle generation (was single-threaded, now 50 concurrent). + +**Lessons learned:** +- Many hosts (28%) have no usable favicon at all — their /favicon.ico returns HTML or 404, and they have no link rel="icon". These appear in bundles title-only. +- The golang-ico library panics on certain malformed ICO files (index out of bounds). Third-party decoders need panic recovery wrappers. +- 80x80 icons are overwhelmingly one single default favicon shared by a hosting platform (~4,276 sites share one hash). Content-addressed storage handles this. +- Bundle sizes are very heterogeneous (39KB to 198KB) due to icon size variance. Average 216KB is well within our target. +- SVG favicons are ~3.5% of downloaded icons (5,128 out of 156K). Supporting SVG rasterization would recover ~1,077 hosts. Deferred to future improvement. + --- ## Future Improvements @@ -631,3 +581,5 @@ On completion, each program prints a summary line and writes its stats JSON (wit - **Encoding: investigate remaining garbled titles** — Some titles still show `�` in output (e.g., `BERGSTRANDS BAGERI �...`). These are pages that lie about their encoding. Could try more aggressive charset detection heuristics. - **Icon download: retry transient failures** — DNS and timeout failures could benefit from a single retry. Would recover a small percentage of icons. - **Icon download: download large link_rel icons** — Currently skipping declared sizes >64x64. Re-run with broader filter for future high-res projects. +- **Bundle gen: SVG rasterization** — ~1,077 hosts have SVG-only favicons. Could add `rsvg-convert` or a Go SVG library to rasterize these. +- **Bundle gen: smarter downscaling** — Currently nearest-neighbor to 32x32 for >128px icons. Could use bilinear/Lanczos for better quality, or preserve aspect ratio for non-square icons.