added icon downloader

This commit is contained in:
Joe Lothan 2026-05-17 22:09:03 -04:00
parent 8b5693b5c6
commit 5a2e37ae06
10 changed files with 829 additions and 68 deletions

106
PLAN.md
View file

@ -150,78 +150,32 @@ Binary: `pipeline/02_warc_parse/` (5 files: main.go, warc.go, parser.go, process
---
## Phase 3: Icon Download (Stage 3)
## Phase 3: Icon Download (Stage 3) [COMPLETED]
### Step 3.1: Icon Downloader Go Program
### Steps 3.1-3.3 [COMPLETED]
```
pipeline/03_icon_download/
├── main.go # Entry point, CLI flags, worker pool
├── downloader.go # HTTP fetch with timeouts, size limits
├── decoder.go # Image validation + dimension extraction
├── s3.go # Upload to everytab-icons bucket
└── db.go # Claim work, update results
```
Binary: `pipeline/03_icon_download/` (6 files: main.go, download.go, image.go, s3.go, db.go, log.go)
CLI flags:
- `--db` connection string
- `--s3-bucket` (default `everytab-icons`)
- `--concurrency` (default 1000, tunable)
- `--batch-size` (default 500)
- `--timeout` (default 10s)
- `--max-size` (default 512KB)
- `--dry-run` (fetch and validate but don't upload to S3 or update DB)
- `--limit` (process at most N icons)
**Architecture:**
- Channel-based work distribution: producer goroutine claims batches, N worker goroutines consume from buffered channel (no worker starvation)
- Shared `http.Transport` for connection pooling / TLS session reuse
- Content-addressed S3 storage (SHA-256 hash as key, dedup via HeadObject before upload)
- Magic byte validation (PNG, GIF, JPEG, ICO, BMP, WebP, SVG)
- ICO directory parsing for dimensions (picks largest ≤64x64)
- Filters to eligible icons only: `favicon_ico` + link_rel with no declared size or ≤64x64
- md5(id) shuffle in claim query to spread requests across hosts
- Panic recovery per worker, DB errors tracked and logged
Dependencies:
- `github.com/jackc/pgx/v5` — Postgres
- `github.com/aws/aws-sdk-go-v2` — S3 uploads
- `github.com/schollz/progressbar/v3` — Progress bar
- Standard library `image` + `image/png`, `image/gif`, `image/jpeg` for decoding dimensions
- `golang.org/x/image/webp` — WebP decoding
- ICO parsing: write a minimal decoder (ICO format is simple — 6-byte header + directory entries pointing to BMP/PNG data) or find a maintained library at implementation time
**CLI:** `./icon_download --db URL [--s3-bucket NAME] [--concurrency N] [--batch-size N] [--timeout D] [--max-size N] [--limit N] [--dry-run] [--log-file PATH] [--log-errors-only]`
### Step 3.2: Work Claiming + Download Logic
Implement:
1. Claim batch with randomized order (md5 shuffle, FOR UPDATE SKIP LOCKED)
2. For each icon URL:
- HTTP GET with timeouts (5s dial, 10s total)
- Read up to max-size bytes, abort if exceeded
- Validate magic bytes (PNG: `\x89PNG`, GIF: `GIF8`, ICO: `\x00\x00\x01\x00`, etc.)
- Determine actual content type from magic bytes (don't trust HTTP Content-Type)
- Decode dimensions:
- PNG/GIF/JPEG/WebP/BMP: read image header (Go `image.DecodeConfig`)
- ICO: parse directory entries, find largest at standard size ≤64x64
- SVG: set width=NULL, height=NULL
- Compute SHA-256 of full content
- Check if S3 key exists (HEAD request); if yes, skip upload (dedup)
- Upload to S3 if new
3. Update icons row with results (or error)
**Dry-run test:** `--limit 200 --dry-run` — prints what it would do for 200 icons. Check URLs, detected types, dimensions.
**Done when:** Can download, validate, and upload icons for a small batch.
### Step 3.3: Full 100K Icon Run
Run against all icons in the database (likely 150K-300K icon rows for 100K hosts).
Monitor:
- icons/sec throughput
- Error breakdown (DNS failures, timeouts, HTTP errors, invalid images)
- S3 dedup hit rate
- Memory usage (adjust concurrency if needed)
**Validation:**
- `SELECT scan_state, COUNT(*) FROM icons GROUP BY scan_state;` — expect mostly completed, some failed
- `SELECT error, COUNT(*) FROM icons WHERE scan_state = 'failed' GROUP BY error ORDER BY count DESC LIMIT 20;` — understand failure modes
- `aws s3 ls s3://everytab-icons/ | wc -l` — confirm icons in S3
- Spot-check: download a few icons from S3, open them, verify they're valid images
**Stats:** `stats/03_icon_download.json`
**Done when:** Icon download complete for 100K dev set, error rate understood, S3 populated.
**Result (100K hosts, ~224K eligible icons):**
- Duration: 10m36s (351 icons/sec)
- Completed: 156,214 (70%)
- Failed: 67,459 (30% — mostly HTTP 404s from stale crawl data)
- Dedup hits: 55,771 (25% — shared Wix/WordPress/hosted platform favicons)
- Downloaded: 1.9GB
- DNS errors: 1,668 | Timeouts: 2,129 | HTTP errors: 47,565 | Invalid: 11,803 | Too large: 777
- DB errors: 0 | Panics: 0
---
@ -650,6 +604,22 @@ On completion, each program prints a summary line and writes its stats JSON (wit
- Increasing concurrency from 100 to 500 didn't improve throughput (~300 hosts/sec either way). The bottleneck is likely Postgres write latency or S3 per-connection bandwidth, not parallelism. Could investigate batch inserts for the full run.
- Progress bars and per-item log lines don't mix well in terminals. Pick one or write progress to a separate channel (file, stderr).
### Phase 3 — Completed 2026-05-18
**Changes from original plan:**
- Filtered eligible icons before downloading: skip link_rel icons with declared size >64x64 (apple-touch-icon bloat). Reduced download count from ~302K to ~224K.
- Channel-based worker pool instead of semaphore pattern — producer goroutine feeds work channel, N workers consume. No starvation between batch claims.
- Shared http.Transport for connection pooling (marginal benefit since hosts are unique, but reduces GC pressure).
- No progress bar — same approach as Phase 2 (log lines + summary).
- User-Agent set to `EveryTabBot/1.0` with link to `everytab.site/bot` for bot identification.
**Lessons learned:**
- 70% icon download success rate is expected — most failures are 404s from domains/pages that changed since the crawl. This is acceptable loss.
- 25% dedup rate — many hosted platforms (Wix, WordPress.com, Squarespace) serve identical default favicons. Content-addressed S3 storage handles this efficiently.
- `data.commoncrawl.org` rate-limits HTTPS but S3 does not — same pattern as WARC parsing. Use S3 API for all CC access.
- Favicon download is I/O bound (network latency to diverse hosts worldwide). Concurrency helps up to a point, then the long tail of slow/dead servers dominates. 351 icons/sec at 200 concurrency.
- Invalid image detection (magic bytes) catches ~5% of "successful" downloads that are actually HTML error pages served at `/favicon.ico`.
---
## Future Improvements
@ -659,3 +629,5 @@ On completion, each program prints a summary line and writes its stats JSON (wit
- **WARC parser: investigate throughput ceiling** — 300 hosts/sec at both 100 and 500 concurrency suggests a bottleneck. Profile to determine if it's S3 response latency, Postgres writes, or something else. For the full 30M run this determines wall-clock time (~28 hours at current rate).
- **CC-Index query: c5.2xlarge for full run** — 8GB is tight with 6.4GB usage + swap. 16GB instance for the 30M-host full run.
- **Encoding: investigate remaining garbled titles** — Some titles still show `<60>` in output (e.g., `BERGSTRANDS BAGERI <20>...`). These are pages that lie about their encoding. Could try more aggressive charset detection heuristics.
- **Icon download: retry transient failures** — DNS and timeout failures could benefit from a single retry. Would recover a small percentage of icons.
- **Icon download: download large link_rel icons** — Currently skipping declared sizes >64x64. Re-run with broader filter for future high-res projects.