updated PLAN.md for phase 4

This commit is contained in:
Joe Lothan 2026-05-17 23:06:11 -04:00
parent f89883e745
commit 771f5d76ab

130
PLAN.md
View file

@ -179,104 +179,37 @@ Binary: `pipeline/03_icon_download/` (6 files: main.go, download.go, image.go, s
---
## Phase 4: Best Icon Selection & Bundle Generation (Stages 4-5)
## Phase 4: Best Icon Selection & Bundle Generation (Stages 4-5) [COMPLETED]
### Step 4.1: Best Icon Selection SQL
### Step 4.1: Best Icon Selection SQL [COMPLETED]
Write `pipeline/04_best_icon/select.sql`:
Script: `pipeline/04_best_icon/select.sql`
```sql
UPDATE hosts h SET best_icon_s3_key = sub.s3_key
FROM (
SELECT DISTINCT ON (i.host_id) i.host_id, i.s3_key
FROM icons i
WHERE i.scan_state = 'completed'
ORDER BY i.host_id,
CASE
WHEN i.width = i.height AND i.width IN (64, 48, 32, 16) THEN 0
WHEN i.width = i.height AND i.width <= 64 THEN 1
WHEN i.width IS NOT NULL AND i.width <= 64 AND i.height <= 64 THEN 2
ELSE 3
END,
COALESCE(i.width, 0) DESC,
CASE
WHEN i.content_type IN ('image/png', 'image/gif', 'image/x-icon', 'image/vnd.microsoft.icon') THEN 0
WHEN i.content_type = 'image/webp' THEN 1
WHEN i.content_type = 'image/svg+xml' THEN 2
ELSE 3
END,
i.file_size ASC
) sub
WHERE h.id = sub.host_id;
```
Selects the best icon per host using `DISTINCT ON` with priority ordering. Excludes SVGs (can't rasterize) and ≤2x2 icons (tracking pixels). See ARCHITECTURE.md for the full decision flow.
**Validation:**
- `SELECT COUNT(*) FROM hosts WHERE best_icon_s3_key IS NOT NULL;` — expect 60-80% of hosts
- `SELECT COUNT(*) FROM hosts WHERE html_title IS NOT NULL AND best_icon_s3_key IS NULL;` — hosts with title but no icon (will still be in bundles)
- Spot-check: for a few hosts, verify the selected icon is reasonable (correct size, valid)
**Result:** 70,366 hosts got an icon (72%), 23,018 have title but no icon.
**Stats:** `stats/04_best_icon.json`
### Steps 4.2-4.4: Bundle Generator [COMPLETED]
**Done when:** best_icon_s3_key populated for hosts that have valid icons.
Binary: `pipeline/05_bundle_gen/` (6 files: main.go, bundle.go, convert.go, db.go, s3.go, log.go)
### Step 4.2: Bundle Generator Go Program
**Architecture:**
- Queries all hosts with titles (randomized), concurrently downloads best icon from S3 icons bucket
- Uses `github.com/biessek/golang-ico` for ICO decoding (handles all bit depths including palette-based 1/4/8bpp)
- `image.Decode` handles PNG/GIF/JPEG/WebP/BMP/ICO via registered decoders. SVGs excluded.
- Icons >128px downscaled to 32x32 (nearest-neighbor). Icons ≤128px kept as-is.
- Re-encodes all icons as PNG, base64-encoded inline in bundle JSON.
- Panic recovery per icon conversion (malformed ICO files in the library)
- Concurrent S3 downloads with configurable concurrency (default 50)
```
pipeline/05_bundle_gen/
├── main.go # Entry point, CLI flags
├── db.go # Query hosts + icon keys
├── convert.go # Icon format conversion → PNG
├── bundle.go # Chunk + serialize JSON
└── s3.go # Upload bundles to everytab-site
```
**CLI:** `./bundle_gen --db URL [--icons-bucket NAME] [--site-bucket NAME] [--entries-per-bundle N] [--concurrency N] [--limit N] [--dry-run] [--output-dir DIR] [--log-file PATH] [--log-errors-only]`
CLI flags:
- `--db` connection string
- `--icons-bucket` (default `everytab-icons`)
- `--site-bucket` (default `everytab-site`)
- `--entries-per-bundle` (tunable, start at 120)
- `--dry-run` (generate bundles to local disk, don't upload)
- `--limit` (only process N hosts, for testing)
### Step 4.3: Icon Conversion Logic
Implement format conversion to PNG:
1. Download icon from S3 by key
2. Detect format from magic bytes
3. Decode:
- PNG: decode directly
- ICO: parse container, extract image at recorded width/height, decode BMP or PNG within
- GIF/JPEG/BMP/WebP: decode to RGBA
- SVG: rasterize to 32x32 (use a Go SVG library, or shell out to `rsvg-convert` if simpler)
4. Re-encode as PNG (optimized, don't upscale)
5. Base64-encode
**Test:** Convert 50 icons of mixed formats manually, verify output PNGs look correct.
### Step 4.4: Bundle Assembly + Upload
Implement:
1. Query all hosts WHERE html_title IS NOT NULL, randomize (ORDER BY random())
2. For each host: fetch + convert its icon (or set empty string if no icon)
3. Assemble entries into chunks of `ENTRIES_PER_BUNDLE`
4. Serialize each chunk as JSON (`tabs/{n}.json`)
5. Upload to S3 `everytab-site/tabs/`
6. Record total bundle count
**Dry-run:** Generate bundles to local disk, inspect a few:
- Valid JSON
- Icons render in browser (paste a data:image/png;base64,... URI)
- Entries have host, title, icon, icon_w, icon_h, iframe_ok
**Validation:**
- Bundle files exist in S3
- `aws s3 ls s3://everytab-site/tabs/ | wc -l` matches expected count
- Random bundle can be fetched and parsed as JSON
- Total hosts across all bundles = count of hosts with titles
**Stats:** `stats/05_bundle_gen.json`
**Done when:** All bundles uploaded to S3, JSON is valid, icons render.
**Result (93K hosts with titles, 70K with icons):**
- Duration: 1m30s
- Bundles created: 779 (120 entries each, last bundle partial)
- Total size: 165MB (avg 216KB per bundle)
- Convert errors: 1,263 (1,077 SVGs + 186 other — panics, truncated files, corrupt GIFs)
- S3: 779 JSON files in `everytab-site/tabs/`
---
@ -620,6 +553,23 @@ On completion, each program prints a summary line and writes its stats JSON (wit
- Favicon download is I/O bound (network latency to diverse hosts worldwide). Concurrency helps up to a point, then the long tail of slow/dead servers dominates. 351 icons/sec at 200 concurrency.
- Invalid image detection (magic bytes) catches ~5% of "successful" downloads that are actually HTML error pages served at `/favicon.ico`.
### Phase 4 — Completed 2026-05-18
**Changes from original plan:**
- Used `github.com/biessek/golang-ico` instead of hand-rolled ICO decoder. Handles all bit depths (1/4/8/24/32bpp) correctly. Eliminated ~20 ICO decode errors from the hand-rolled version.
- SVGs excluded from best-icon selection (can't rasterize without external deps). SVG-only hosts show up with no icon instead of failing at conversion time.
- Added ≤2x2 pixel exclusion from best-icon selection (tracking pixels / garbage favicons).
- Icons >128px downscaled to 32x32 during bundle generation. Icons ≤128px (including 80x80) kept as-is — browser CSS handles display scaling.
- Added panic recovery around icon conversion (the ICO library panics on some malformed files).
- Added concurrency for S3 icon downloads during bundle generation (was single-threaded, now 50 concurrent).
**Lessons learned:**
- Many hosts (28%) have no usable favicon at all — their /favicon.ico returns HTML or 404, and they have no link rel="icon". These appear in bundles title-only.
- The golang-ico library panics on certain malformed ICO files (index out of bounds). Third-party decoders need panic recovery wrappers.
- 80x80 icons are overwhelmingly one single default favicon shared by a hosting platform (~4,276 sites share one hash). Content-addressed storage handles this.
- Bundle sizes are very heterogeneous (39KB to 198KB) due to icon size variance. Average 216KB is well within our target.
- SVG favicons are ~3.5% of downloaded icons (5,128 out of 156K). Supporting SVG rasterization would recover ~1,077 hosts. Deferred to future improvement.
---
## Future Improvements
@ -631,3 +581,5 @@ On completion, each program prints a summary line and writes its stats JSON (wit
- **Encoding: investigate remaining garbled titles** — Some titles still show `<60>` in output (e.g., `BERGSTRANDS BAGERI <20>...`). These are pages that lie about their encoding. Could try more aggressive charset detection heuristics.
- **Icon download: retry transient failures** — DNS and timeout failures could benefit from a single retry. Would recover a small percentage of icons.
- **Icon download: download large link_rel icons** — Currently skipping declared sizes >64x64. Re-run with broader filter for future high-res projects.
- **Bundle gen: SVG rasterization** — ~1,077 hosts have SVG-only favicons. Could add `rsvg-convert` or a Go SVG library to rasterize these.
- **Bundle gen: smarter downscaling** — Currently nearest-neighbor to 32x32 for >128px icons. Could use bilinear/Lanczos for better quality, or preserve aspect ratio for non-square icons.