rewrote icon selection in english rather than sql

This commit is contained in:
Joe Lothan 2026-05-17 22:22:32 -04:00
parent 5a2e37ae06
commit 6cf6049698

View file

@ -276,39 +276,24 @@ WHERE url_path = '/'
**Tool:** SQL script
**Process:** For each host, select the best icon from its completed downloads:
**Process:** For each host, select the best icon from all its completed downloads.
```sql
UPDATE hosts h SET best_icon_s3_key = (
SELECT i.s3_key FROM icons i
WHERE i.host_id = h.id
AND i.scan_state = 'completed'
ORDER BY
-- Prefer standard square sizes
CASE
WHEN i.width = i.height AND i.width IN (64, 48, 32, 16) THEN 0
WHEN i.width = i.height AND i.width <= 64 THEN 1
WHEN i.width <= 64 AND i.height <= 64 THEN 2
ELSE 3
END,
-- Among valid options, prefer larger
i.width DESC,
-- Prefer PNG/GIF/ICO over SVG/WebP for simpler processing
CASE
WHEN i.content_type IN ('image/png', 'image/gif', 'image/x-icon', 'image/vnd.microsoft.icon') THEN 0
WHEN i.content_type IN ('image/webp') THEN 1
WHEN i.content_type IN ('image/svg+xml') THEN 2
ELSE 3
END,
-- Smaller file size as tiebreaker
i.file_size ASC
LIMIT 1
);
```
**Selection priority (decision flow):**
**Note on SVG/WebP:** These are downloaded and stored during scanning but are lower priority for bundle selection. Rasterizing SVG to PNG adds complexity; WebP re-encoding to PNG may increase size. If a host ONLY has SVG/WebP icons, we still use them (convert in bundle generation). But if PNG/GIF/ICO alternatives exist, prefer those.
1. Standard square sizes (32x32, 64x64, 48x48, 16x16) — ideal for tab display. Prefer larger.
2. Other square sizes ≤64px — close enough. Prefer larger.
3. Non-square but both dimensions ≤64px — acceptable. Prefer larger.
4. Everything else (180x180, 192x192, SVG with no dimensions, etc.) — last resort, will be downscaled in bundle generation.
**Stats emitted:** Hosts with icons selected, hosts without any icon, icon size distribution, format distribution of selected icons.
Within the same tier: prefer PNG/GIF/ICO over WebP over SVG, then smaller file size as tiebreaker.
Does not distinguish between `favicon_ico` and `link_rel` sources — purely based on what was actually downloaded and its dimensions/format.
Uses `DISTINCT ON (host_id)` for efficient single-pass selection. See `pipeline/04_best_icon/select.sql`.
**Note on SVG/WebP:** Lower priority because rasterizing SVG adds complexity and WebP-to-PNG re-encoding may increase size. Only selected when no raster alternatives exist.
**Stats emitted:** Hosts with icons selected, hosts without any icon.
### Stage 5: Bundle Generation