diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md index 86f439e..03579d6 100644 --- a/ARCHITECTURE.md +++ b/ARCHITECTURE.md @@ -13,28 +13,33 @@ The scanning phase runs monthly (triggered by new Common Crawl releases), produc ```mermaid flowchart TD - subgraph "Scanning Phase (EC2 instance)" - A[Stage 1: Query CC-Index via DuckDB] --> B[Stage 2: Parse WARCs - Go] - B --> C[Stage 3: Download Icons - Go] - C --> D[Stage 4: Select Best Icons] - D --> E[Stage 5: Generate Bundles - Go] - E --> F[Stage 6: Build Frontend] + subgraph EC2["Scanning Phase (EC2 instance)"] + A["Stage 1: Query CC-Index via DuckDB"] + B["Stage 2: Parse WARCs - Go"] + C["Stage 3: Download Icons - Go"] + D["Stage 4: Select Best Icons"] + E["Stage 5: Generate Bundles - Go"] + F["Stage 6: Build Frontend"] + UB["Unbound - Local recursive resolver"] + + A --> B --> C --> D --> E --> F + UB -.-> C end - subgraph "External Data" - CC[Common Crawl S3\nParquet Index + WARCs] + subgraph ExtData["External Data"] + CC["Common Crawl S3 - Parquet Index + WARCs"] end - subgraph "AWS Services" - RDS[(RDS Postgres\nhosts + icons tables)] - S3I[S3: everytab-icons\nRaw downloaded favicons] - S3S[S3: everytab-site\ntabs/*.json + index.html] - CF[CloudFront CDN] + subgraph AWS["AWS Services"] + RDS[("RDS Postgres - hosts + icons tables")] + S3I["S3: everytab-icons - Raw downloaded favicons"] + S3S["S3: everytab-site - tabs/*.json + index.html"] + CF["CloudFront CDN"] end - subgraph "Post-Scan" - BAK[Backup to Homelab\nRDS dump + icons sync] - TEAR[Teardown\nDelete RDS, icons bucket, EC2] + subgraph Post["Post-Scan"] + BAK["Backup to Homelab - RDS dump + icons sync"] + TEAR["Teardown - Delete RDS, icons bucket, EC2"] end CC --> A @@ -51,11 +56,6 @@ flowchart TD F --> BAK BAK --> TEAR - - subgraph "DNS" - UB[Unbound\nLocal recursive resolver\non EC2] - end - UB -.-> C ``` **Key point:** DuckDB, Go programs, and Unbound all run on the same EC2 instance. The pipeline is sequential — one stage completes before the next begins. @@ -114,8 +114,8 @@ All resources in **us-east-1**. | rel_type | TEXT | MIME type from HTML attribute (if specified) | | rel_sizes | TEXT | Sizes attribute from HTML (if specified) | | content_type | TEXT | Actual MIME type after download | -| width | INT | Decoded pixel width | -| height | INT | Decoded pixel height | +| width | INT | Best usable pixel width (for ICO: largest standard size ≤64; for SVG: NULL) | +| height | INT | Best usable pixel height (for ICO: largest standard size ≤64; for SVG: NULL) | | file_size | INT | Size in bytes | | s3_key | TEXT | Key in everytab-icons bucket (SHA-256 of content) | | scan_state | TEXT DEFAULT 'unscanned' | `unscanned`, `in_progress`, `completed`, `failed` | @@ -152,7 +152,7 @@ All resources in **us-east-1**. Icons are stored inline as base64-encoded PNG. Hosts without favicons are included (with `"icon": ""`) as long as they have a title. CloudFront serves bundles with Brotli compression, which significantly reduces transfer size of base64 data. -Bundle size targets ~100-150 entries (enough to fill a viewport with buffer for scrolling). Estimated ~150-300KB per bundle uncompressed, smaller after Brotli. +Bundle size is parameterized (`ENTRIES_PER_BUNDLE`). Target: enough entries to fill a viewport plus scroll buffer. Initial estimate ~100-150 entries (~150-300KB uncompressed, smaller after Brotli). Will be tuned empirically once the frontend is built and we can measure how many tabs fill a screen. ## Pipeline Stages @@ -223,14 +223,28 @@ WHERE url_path = '/' **Input:** `icons` table rows where `scan_state = 'unscanned'` **Process:** -1. Claim batch: `UPDATE icons SET scan_state = 'in_progress' WHERE scan_state = 'unscanned' AND id IN (SELECT id FROM icons WHERE scan_state = 'unscanned' LIMIT N FOR UPDATE SKIP LOCKED) RETURNING *` +1. Claim batch (randomized to spread load across hosts): + ```sql + UPDATE icons SET scan_state = 'in_progress' + WHERE id IN ( + SELECT id FROM icons + WHERE scan_state = 'unscanned' + ORDER BY md5(id::text) -- deterministic shuffle: spreads hosts apart + LIMIT N + FOR UPDATE SKIP LOCKED + ) RETURNING *; + ``` + This ensures requests to the same domain aren't back-to-back. With 30M+ icons from different hosts, a random batch of 1000 almost never contains two icons from the same server. 2. For each icon URL: - Make HTTP(S) GET request (standard Go HTTP client — DNS transparently goes through Unbound) - Enforce timeouts: 5s connect, 10s total - Enforce max download size: 512KB (generous for icons, but prevents abuse) - On success: - Validate magic bytes (is this actually an image?) - - Decode to get dimensions (width, height) — just read headers, don't fully decode + - Decode to get dimensions: + - PNG/GIF/WebP/JPEG/BMP: read image headers for width/height + - ICO: parse ICO header, find largest embedded size ≤64x64 at a standard dimension (16/32/48/64), store THAT size in width/height + - SVG: store width=NULL, height=NULL (vector, no pixel size) - Compute SHA-256 of content - Upload to S3 `everytab-icons/{sha256}` (skip if key already exists — dedup) - Update icons row: s3_key, content_type (from actual data, not HTTP header), width, height, file_size, scan_state = 'completed' @@ -303,15 +317,15 @@ UPDATE hosts h SET best_icon_s3_key = ( 2. Randomize the full result set 3. For each host with an icon (best_icon_s3_key IS NOT NULL): - Download from S3 `everytab-icons/{s3_key}` - - Decode the image (handle ICO, PNG, GIF, WebP, SVG): - - ICO: extract the largest embedded image at a standard size <= 64x64, decode to raster - - SVG: rasterize to 32x32 PNG - - WebP/GIF/BMP: decode to raster - - PNG: use as-is (re-compress if possible) - - Re-encode as optimized PNG (preserve original dimensions, don't upscale) + - Decode the image based on format: + - ICO: parse container, extract the image at the size recorded in width/height (the largest standard size ≤64x64). ICO can embed BMP or PNG internally — decode whichever is present. + - PNG: decode directly + - GIF/WebP/BMP/JPEG: decode to raster + - SVG: rasterize to 32x32 (use a Go SVG rasterizer library) + - Re-encode as optimized PNG at original dimensions (never upscale — a 16x16 stays 16x16) - Base64-encode the PNG bytes 4. For hosts without icons: set icon to empty string -5. Chunk into groups of N entries (~100-150, tuned to fill a viewport) +5. Chunk into groups of `ENTRIES_PER_BUNDLE` entries (parameterized, initially ~100-150, tuned to viewport fill) 6. Serialize each chunk as JSON, write to S3 `everytab-site/tabs/{n}.json` 7. Record total bundle count @@ -333,11 +347,11 @@ UPDATE hosts h SET best_icon_s3_key = ( ### Stage 7: Backup & Teardown -**Process (manual, with confirmation):** +**Process (manual, with confirmation at each step):** 1. Dump RDS database: `pg_dump` → transfer to homelab 2. Sync icons: `aws s3 sync s3://everytab-icons/ homelab:/path/to/backup/icons/` -3. **Confirm backups are complete and verified** -4. Delete RDS instance (with final snapshot as safety net) +3. **Verify backups:** confirm pg_dump restores cleanly on homelab, spot-check icon files +4. Delete RDS instance (skip final snapshot — homelab backup is the source of truth, snapshots cost $0.095/GB-month) 5. Delete S3 `everytab-icons` bucket 6. Terminate EC2 instance