diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md index e362553..2a09903 100644 --- a/ARCHITECTURE.md +++ b/ARCHITECTURE.md @@ -276,39 +276,24 @@ WHERE url_path = '/' **Tool:** SQL script -**Process:** For each host, select the best icon from its completed downloads: +**Process:** For each host, select the best icon from all its completed downloads. -```sql -UPDATE hosts h SET best_icon_s3_key = ( - SELECT i.s3_key FROM icons i - WHERE i.host_id = h.id - AND i.scan_state = 'completed' - ORDER BY - -- Prefer standard square sizes - CASE - WHEN i.width = i.height AND i.width IN (64, 48, 32, 16) THEN 0 - WHEN i.width = i.height AND i.width <= 64 THEN 1 - WHEN i.width <= 64 AND i.height <= 64 THEN 2 - ELSE 3 - END, - -- Among valid options, prefer larger - i.width DESC, - -- Prefer PNG/GIF/ICO over SVG/WebP for simpler processing - CASE - WHEN i.content_type IN ('image/png', 'image/gif', 'image/x-icon', 'image/vnd.microsoft.icon') THEN 0 - WHEN i.content_type IN ('image/webp') THEN 1 - WHEN i.content_type IN ('image/svg+xml') THEN 2 - ELSE 3 - END, - -- Smaller file size as tiebreaker - i.file_size ASC - LIMIT 1 -); -``` +**Selection priority (decision flow):** -**Note on SVG/WebP:** These are downloaded and stored during scanning but are lower priority for bundle selection. Rasterizing SVG to PNG adds complexity; WebP re-encoding to PNG may increase size. If a host ONLY has SVG/WebP icons, we still use them (convert in bundle generation). But if PNG/GIF/ICO alternatives exist, prefer those. +1. Standard square sizes (32x32, 64x64, 48x48, 16x16) — ideal for tab display. Prefer larger. +2. Other square sizes ≤64px — close enough. Prefer larger. +3. Non-square but both dimensions ≤64px — acceptable. Prefer larger. +4. Everything else (180x180, 192x192, SVG with no dimensions, etc.) — last resort, will be downscaled in bundle generation. -**Stats emitted:** Hosts with icons selected, hosts without any icon, icon size distribution, format distribution of selected icons. +Within the same tier: prefer PNG/GIF/ICO over WebP over SVG, then smaller file size as tiebreaker. + +Does not distinguish between `favicon_ico` and `link_rel` sources — purely based on what was actually downloaded and its dimensions/format. + +Uses `DISTINCT ON (host_id)` for efficient single-pass selection. See `pipeline/04_best_icon/select.sql`. + +**Note on SVG/WebP:** Lower priority because rasterizing SVG adds complexity and WebP-to-PNG re-encoding may increase size. Only selected when no raster alternatives exist. + +**Stats emitted:** Hosts with icons selected, hosts without any icon. ### Stage 5: Bundle Generation