added icon downloader

2026-05-17 22:09:03 -04:00 · 2026-05-17 22:09:03 -04:00 · 5a2e37ae06
commit 5a2e37ae06
parent 8b5693b5c6
10 changed files with 829 additions and 68 deletions
--- a/PLAN.md
+++ b/PLAN.md
@ -150,78 +150,32 @@ Binary: `pipeline/02_warc_parse/` (5 files: main.go, warc.go, parser.go, process

 ---

-## Phase 3: Icon Download (Stage 3)
+## Phase 3: Icon Download (Stage 3) [COMPLETED]

-### Step 3.1: Icon Downloader Go Program
+### Steps 3.1-3.3 [COMPLETED]

-```
-pipeline/03_icon_download/
-├── main.go          # Entry point, CLI flags, worker pool
-├── downloader.go    # HTTP fetch with timeouts, size limits
-├── decoder.go       # Image validation + dimension extraction
-├── s3.go            # Upload to everytab-icons bucket
-└── db.go            # Claim work, update results
-```
+Binary: `pipeline/03_icon_download/` (6 files: main.go, download.go, image.go, s3.go, db.go, log.go)

-CLI flags:
- `--db` connection string
- `--s3-bucket` (default `everytab-icons`)
- `--concurrency` (default 1000, tunable)
- `--batch-size` (default 500)
- `--timeout` (default 10s)
- `--max-size` (default 512KB)
- `--dry-run` (fetch and validate but don't upload to S3 or update DB)
- `--limit` (process at most N icons)
+**Architecture:**
+- Channel-based work distribution: producer goroutine claims batches, N worker goroutines consume from buffered channel (no worker starvation)
+- Shared `http.Transport` for connection pooling / TLS session reuse
+- Content-addressed S3 storage (SHA-256 hash as key, dedup via HeadObject before upload)
+- Magic byte validation (PNG, GIF, JPEG, ICO, BMP, WebP, SVG)
+- ICO directory parsing for dimensions (picks largest ≤64x64)
+- Filters to eligible icons only: `favicon_ico` + link_rel with no declared size or ≤64x64
+- md5(id) shuffle in claim query to spread requests across hosts
+- Panic recovery per worker, DB errors tracked and logged

-Dependencies:
- `github.com/jackc/pgx/v5` — Postgres
- `github.com/aws/aws-sdk-go-v2` — S3 uploads
- `github.com/schollz/progressbar/v3` — Progress bar
- Standard library `image` + `image/png`, `image/gif`, `image/jpeg` for decoding dimensions
- `golang.org/x/image/webp` — WebP decoding
- ICO parsing: write a minimal decoder (ICO format is simple — 6-byte header + directory entries pointing to BMP/PNG data) or find a maintained library at implementation time
+**CLI:** `./icon_download --db URL [--s3-bucket NAME] [--concurrency N] [--batch-size N] [--timeout D] [--max-size N] [--limit N] [--dry-run] [--log-file PATH] [--log-errors-only]`

-### Step 3.2: Work Claiming + Download Logic
-
-Implement:
-1. Claim batch with randomized order (md5 shuffle, FOR UPDATE SKIP LOCKED)
-2. For each icon URL:
-   - HTTP GET with timeouts (5s dial, 10s total)
-   - Read up to max-size bytes, abort if exceeded
-   - Validate magic bytes (PNG: `\x89PNG`, GIF: `GIF8`, ICO: `\x00\x00\x01\x00`, etc.)
-   - Determine actual content type from magic bytes (don't trust HTTP Content-Type)
-   - Decode dimensions:
-     - PNG/GIF/JPEG/WebP/BMP: read image header (Go `image.DecodeConfig`)
-     - ICO: parse directory entries, find largest at standard size ≤64x64
-     - SVG: set width=NULL, height=NULL
-   - Compute SHA-256 of full content
-   - Check if S3 key exists (HEAD request); if yes, skip upload (dedup)
-   - Upload to S3 if new
-3. Update icons row with results (or error)
-
-**Dry-run test:** `--limit 200 --dry-run` — prints what it would do for 200 icons. Check URLs, detected types, dimensions.
-
-**Done when:** Can download, validate, and upload icons for a small batch.
-
-### Step 3.3: Full 100K Icon Run
-
-Run against all icons in the database (likely 150K-300K icon rows for 100K hosts).
-
-Monitor:
- icons/sec throughput
- Error breakdown (DNS failures, timeouts, HTTP errors, invalid images)
- S3 dedup hit rate
- Memory usage (adjust concurrency if needed)
-
-**Validation:**
- `SELECT scan_state, COUNT(*) FROM icons GROUP BY scan_state;` — expect mostly completed, some failed
- `SELECT error, COUNT(*) FROM icons WHERE scan_state = 'failed' GROUP BY error ORDER BY count DESC LIMIT 20;` — understand failure modes
- `aws s3 ls s3://everytab-icons/ | wc -l` — confirm icons in S3
- Spot-check: download a few icons from S3, open them, verify they're valid images
-
-**Stats:** `stats/03_icon_download.json`
-
-**Done when:** Icon download complete for 100K dev set, error rate understood, S3 populated.
+**Result (100K hosts, ~224K eligible icons):**
+- Duration: 10m36s (351 icons/sec)
+- Completed: 156,214 (70%)
+- Failed: 67,459 (30% — mostly HTTP 404s from stale crawl data)
+- Dedup hits: 55,771 (25% — shared Wix/WordPress/hosted platform favicons)
+- Downloaded: 1.9GB
+- DNS errors: 1,668 | Timeouts: 2,129 | HTTP errors: 47,565 | Invalid: 11,803 | Too large: 777
+- DB errors: 0 | Panics: 0

 ---

@ -650,6 +604,22 @@ On completion, each program prints a summary line and writes its stats JSON (wit
 - Increasing concurrency from 100 to 500 didn't improve throughput (~300 hosts/sec either way). The bottleneck is likely Postgres write latency or S3 per-connection bandwidth, not parallelism. Could investigate batch inserts for the full run.
 - Progress bars and per-item log lines don't mix well in terminals. Pick one or write progress to a separate channel (file, stderr).

+### Phase 3 — Completed 2026-05-18
+
+**Changes from original plan:**
+- Filtered eligible icons before downloading: skip link_rel icons with declared size >64x64 (apple-touch-icon bloat). Reduced download count from ~302K to ~224K.
+- Channel-based worker pool instead of semaphore pattern — producer goroutine feeds work channel, N workers consume. No starvation between batch claims.
+- Shared http.Transport for connection pooling (marginal benefit since hosts are unique, but reduces GC pressure).
+- No progress bar — same approach as Phase 2 (log lines + summary).
+- User-Agent set to `EveryTabBot/1.0` with link to `everytab.site/bot` for bot identification.
+
+**Lessons learned:**
+- 70% icon download success rate is expected — most failures are 404s from domains/pages that changed since the crawl. This is acceptable loss.
+- 25% dedup rate — many hosted platforms (Wix, WordPress.com, Squarespace) serve identical default favicons. Content-addressed S3 storage handles this efficiently.
+- `data.commoncrawl.org` rate-limits HTTPS but S3 does not — same pattern as WARC parsing. Use S3 API for all CC access.
+- Favicon download is I/O bound (network latency to diverse hosts worldwide). Concurrency helps up to a point, then the long tail of slow/dead servers dominates. 351 icons/sec at 200 concurrency.
+- Invalid image detection (magic bytes) catches ~5% of "successful" downloads that are actually HTML error pages served at `/favicon.ico`.
+
 ---

 ## Future Improvements
@ -659,3 +629,5 @@ On completion, each program prints a summary line and writes its stats JSON (wit
 - **WARC parser: investigate throughput ceiling** — 300 hosts/sec at both 100 and 500 concurrency suggests a bottleneck. Profile to determine if it's S3 response latency, Postgres writes, or something else. For the full 30M run this determines wall-clock time (~28 hours at current rate).
 - **CC-Index query: c5.2xlarge for full run** — 8GB is tight with 6.4GB usage + swap. 16GB instance for the 30M-host full run.
 - **Encoding: investigate remaining garbled titles** — Some titles still show `<60>` in output (e.g., `BERGSTRANDS BAGERI <20>...`). These are pages that lie about their encoding. Could try more aggressive charset detection heuristics.
+- **Icon download: retry transient failures** — DNS and timeout failures could benefit from a single retry. Would recover a small percentage of icons.
+- **Icon download: download large link_rel icons** — Currently skipping declared sizes >64x64. Re-run with broader filter for future high-res projects.