updated PLAN.md for phase 4
This commit is contained in:
parent
f89883e745
commit
771f5d76ab
1 changed files with 41 additions and 89 deletions
130
PLAN.md
130
PLAN.md
|
|
@ -179,104 +179,37 @@ Binary: `pipeline/03_icon_download/` (6 files: main.go, download.go, image.go, s
|
|||
|
||||
---
|
||||
|
||||
## Phase 4: Best Icon Selection & Bundle Generation (Stages 4-5)
|
||||
## Phase 4: Best Icon Selection & Bundle Generation (Stages 4-5) [COMPLETED]
|
||||
|
||||
### Step 4.1: Best Icon Selection SQL
|
||||
### Step 4.1: Best Icon Selection SQL [COMPLETED]
|
||||
|
||||
Write `pipeline/04_best_icon/select.sql`:
|
||||
Script: `pipeline/04_best_icon/select.sql`
|
||||
|
||||
```sql
|
||||
UPDATE hosts h SET best_icon_s3_key = sub.s3_key
|
||||
FROM (
|
||||
SELECT DISTINCT ON (i.host_id) i.host_id, i.s3_key
|
||||
FROM icons i
|
||||
WHERE i.scan_state = 'completed'
|
||||
ORDER BY i.host_id,
|
||||
CASE
|
||||
WHEN i.width = i.height AND i.width IN (64, 48, 32, 16) THEN 0
|
||||
WHEN i.width = i.height AND i.width <= 64 THEN 1
|
||||
WHEN i.width IS NOT NULL AND i.width <= 64 AND i.height <= 64 THEN 2
|
||||
ELSE 3
|
||||
END,
|
||||
COALESCE(i.width, 0) DESC,
|
||||
CASE
|
||||
WHEN i.content_type IN ('image/png', 'image/gif', 'image/x-icon', 'image/vnd.microsoft.icon') THEN 0
|
||||
WHEN i.content_type = 'image/webp' THEN 1
|
||||
WHEN i.content_type = 'image/svg+xml' THEN 2
|
||||
ELSE 3
|
||||
END,
|
||||
i.file_size ASC
|
||||
) sub
|
||||
WHERE h.id = sub.host_id;
|
||||
```
|
||||
Selects the best icon per host using `DISTINCT ON` with priority ordering. Excludes SVGs (can't rasterize) and ≤2x2 icons (tracking pixels). See ARCHITECTURE.md for the full decision flow.
|
||||
|
||||
**Validation:**
|
||||
- `SELECT COUNT(*) FROM hosts WHERE best_icon_s3_key IS NOT NULL;` — expect 60-80% of hosts
|
||||
- `SELECT COUNT(*) FROM hosts WHERE html_title IS NOT NULL AND best_icon_s3_key IS NULL;` — hosts with title but no icon (will still be in bundles)
|
||||
- Spot-check: for a few hosts, verify the selected icon is reasonable (correct size, valid)
|
||||
**Result:** 70,366 hosts got an icon (72%), 23,018 have title but no icon.
|
||||
|
||||
**Stats:** `stats/04_best_icon.json`
|
||||
### Steps 4.2-4.4: Bundle Generator [COMPLETED]
|
||||
|
||||
**Done when:** best_icon_s3_key populated for hosts that have valid icons.
|
||||
Binary: `pipeline/05_bundle_gen/` (6 files: main.go, bundle.go, convert.go, db.go, s3.go, log.go)
|
||||
|
||||
### Step 4.2: Bundle Generator Go Program
|
||||
**Architecture:**
|
||||
- Queries all hosts with titles (randomized), concurrently downloads best icon from S3 icons bucket
|
||||
- Uses `github.com/biessek/golang-ico` for ICO decoding (handles all bit depths including palette-based 1/4/8bpp)
|
||||
- `image.Decode` handles PNG/GIF/JPEG/WebP/BMP/ICO via registered decoders. SVGs excluded.
|
||||
- Icons >128px downscaled to 32x32 (nearest-neighbor). Icons ≤128px kept as-is.
|
||||
- Re-encodes all icons as PNG, base64-encoded inline in bundle JSON.
|
||||
- Panic recovery per icon conversion (malformed ICO files in the library)
|
||||
- Concurrent S3 downloads with configurable concurrency (default 50)
|
||||
|
||||
```
|
||||
pipeline/05_bundle_gen/
|
||||
├── main.go # Entry point, CLI flags
|
||||
├── db.go # Query hosts + icon keys
|
||||
├── convert.go # Icon format conversion → PNG
|
||||
├── bundle.go # Chunk + serialize JSON
|
||||
└── s3.go # Upload bundles to everytab-site
|
||||
```
|
||||
**CLI:** `./bundle_gen --db URL [--icons-bucket NAME] [--site-bucket NAME] [--entries-per-bundle N] [--concurrency N] [--limit N] [--dry-run] [--output-dir DIR] [--log-file PATH] [--log-errors-only]`
|
||||
|
||||
CLI flags:
|
||||
- `--db` connection string
|
||||
- `--icons-bucket` (default `everytab-icons`)
|
||||
- `--site-bucket` (default `everytab-site`)
|
||||
- `--entries-per-bundle` (tunable, start at 120)
|
||||
- `--dry-run` (generate bundles to local disk, don't upload)
|
||||
- `--limit` (only process N hosts, for testing)
|
||||
|
||||
### Step 4.3: Icon Conversion Logic
|
||||
|
||||
Implement format conversion to PNG:
|
||||
1. Download icon from S3 by key
|
||||
2. Detect format from magic bytes
|
||||
3. Decode:
|
||||
- PNG: decode directly
|
||||
- ICO: parse container, extract image at recorded width/height, decode BMP or PNG within
|
||||
- GIF/JPEG/BMP/WebP: decode to RGBA
|
||||
- SVG: rasterize to 32x32 (use a Go SVG library, or shell out to `rsvg-convert` if simpler)
|
||||
4. Re-encode as PNG (optimized, don't upscale)
|
||||
5. Base64-encode
|
||||
|
||||
**Test:** Convert 50 icons of mixed formats manually, verify output PNGs look correct.
|
||||
|
||||
### Step 4.4: Bundle Assembly + Upload
|
||||
|
||||
Implement:
|
||||
1. Query all hosts WHERE html_title IS NOT NULL, randomize (ORDER BY random())
|
||||
2. For each host: fetch + convert its icon (or set empty string if no icon)
|
||||
3. Assemble entries into chunks of `ENTRIES_PER_BUNDLE`
|
||||
4. Serialize each chunk as JSON (`tabs/{n}.json`)
|
||||
5. Upload to S3 `everytab-site/tabs/`
|
||||
6. Record total bundle count
|
||||
|
||||
**Dry-run:** Generate bundles to local disk, inspect a few:
|
||||
- Valid JSON
|
||||
- Icons render in browser (paste a data:image/png;base64,... URI)
|
||||
- Entries have host, title, icon, icon_w, icon_h, iframe_ok
|
||||
|
||||
**Validation:**
|
||||
- Bundle files exist in S3
|
||||
- `aws s3 ls s3://everytab-site/tabs/ | wc -l` matches expected count
|
||||
- Random bundle can be fetched and parsed as JSON
|
||||
- Total hosts across all bundles = count of hosts with titles
|
||||
|
||||
**Stats:** `stats/05_bundle_gen.json`
|
||||
|
||||
**Done when:** All bundles uploaded to S3, JSON is valid, icons render.
|
||||
**Result (93K hosts with titles, 70K with icons):**
|
||||
- Duration: 1m30s
|
||||
- Bundles created: 779 (120 entries each, last bundle partial)
|
||||
- Total size: 165MB (avg 216KB per bundle)
|
||||
- Convert errors: 1,263 (1,077 SVGs + 186 other — panics, truncated files, corrupt GIFs)
|
||||
- S3: 779 JSON files in `everytab-site/tabs/`
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -620,6 +553,23 @@ On completion, each program prints a summary line and writes its stats JSON (wit
|
|||
- Favicon download is I/O bound (network latency to diverse hosts worldwide). Concurrency helps up to a point, then the long tail of slow/dead servers dominates. 351 icons/sec at 200 concurrency.
|
||||
- Invalid image detection (magic bytes) catches ~5% of "successful" downloads that are actually HTML error pages served at `/favicon.ico`.
|
||||
|
||||
### Phase 4 — Completed 2026-05-18
|
||||
|
||||
**Changes from original plan:**
|
||||
- Used `github.com/biessek/golang-ico` instead of hand-rolled ICO decoder. Handles all bit depths (1/4/8/24/32bpp) correctly. Eliminated ~20 ICO decode errors from the hand-rolled version.
|
||||
- SVGs excluded from best-icon selection (can't rasterize without external deps). SVG-only hosts show up with no icon instead of failing at conversion time.
|
||||
- Added ≤2x2 pixel exclusion from best-icon selection (tracking pixels / garbage favicons).
|
||||
- Icons >128px downscaled to 32x32 during bundle generation. Icons ≤128px (including 80x80) kept as-is — browser CSS handles display scaling.
|
||||
- Added panic recovery around icon conversion (the ICO library panics on some malformed files).
|
||||
- Added concurrency for S3 icon downloads during bundle generation (was single-threaded, now 50 concurrent).
|
||||
|
||||
**Lessons learned:**
|
||||
- Many hosts (28%) have no usable favicon at all — their /favicon.ico returns HTML or 404, and they have no link rel="icon". These appear in bundles title-only.
|
||||
- The golang-ico library panics on certain malformed ICO files (index out of bounds). Third-party decoders need panic recovery wrappers.
|
||||
- 80x80 icons are overwhelmingly one single default favicon shared by a hosting platform (~4,276 sites share one hash). Content-addressed storage handles this.
|
||||
- Bundle sizes are very heterogeneous (39KB to 198KB) due to icon size variance. Average 216KB is well within our target.
|
||||
- SVG favicons are ~3.5% of downloaded icons (5,128 out of 156K). Supporting SVG rasterization would recover ~1,077 hosts. Deferred to future improvement.
|
||||
|
||||
---
|
||||
|
||||
## Future Improvements
|
||||
|
|
@ -631,3 +581,5 @@ On completion, each program prints a summary line and writes its stats JSON (wit
|
|||
- **Encoding: investigate remaining garbled titles** — Some titles still show `<60>` in output (e.g., `BERGSTRANDS BAGERI <20>...`). These are pages that lie about their encoding. Could try more aggressive charset detection heuristics.
|
||||
- **Icon download: retry transient failures** — DNS and timeout failures could benefit from a single retry. Would recover a small percentage of icons.
|
||||
- **Icon download: download large link_rel icons** — Currently skipping declared sizes >64x64. Re-run with broader filter for future high-res projects.
|
||||
- **Bundle gen: SVG rasterization** — ~1,077 hosts have SVG-only favicons. Could add `rsvg-convert` or a Go SVG library to rasterize these.
|
||||
- **Bundle gen: smarter downscaling** — Currently nearest-neighbor to 32x32 for >128px icons. Could use bilinear/Lanczos for better quality, or preserve aspect ratio for non-square icons.
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue