added icon downloader
This commit is contained in:
parent
8b5693b5c6
commit
5a2e37ae06
10 changed files with 829 additions and 68 deletions
106
PLAN.md
106
PLAN.md
|
|
@ -150,78 +150,32 @@ Binary: `pipeline/02_warc_parse/` (5 files: main.go, warc.go, parser.go, process
|
|||
|
||||
---
|
||||
|
||||
## Phase 3: Icon Download (Stage 3)
|
||||
## Phase 3: Icon Download (Stage 3) [COMPLETED]
|
||||
|
||||
### Step 3.1: Icon Downloader Go Program
|
||||
### Steps 3.1-3.3 [COMPLETED]
|
||||
|
||||
```
|
||||
pipeline/03_icon_download/
|
||||
├── main.go # Entry point, CLI flags, worker pool
|
||||
├── downloader.go # HTTP fetch with timeouts, size limits
|
||||
├── decoder.go # Image validation + dimension extraction
|
||||
├── s3.go # Upload to everytab-icons bucket
|
||||
└── db.go # Claim work, update results
|
||||
```
|
||||
Binary: `pipeline/03_icon_download/` (6 files: main.go, download.go, image.go, s3.go, db.go, log.go)
|
||||
|
||||
CLI flags:
|
||||
- `--db` connection string
|
||||
- `--s3-bucket` (default `everytab-icons`)
|
||||
- `--concurrency` (default 1000, tunable)
|
||||
- `--batch-size` (default 500)
|
||||
- `--timeout` (default 10s)
|
||||
- `--max-size` (default 512KB)
|
||||
- `--dry-run` (fetch and validate but don't upload to S3 or update DB)
|
||||
- `--limit` (process at most N icons)
|
||||
**Architecture:**
|
||||
- Channel-based work distribution: producer goroutine claims batches, N worker goroutines consume from buffered channel (no worker starvation)
|
||||
- Shared `http.Transport` for connection pooling / TLS session reuse
|
||||
- Content-addressed S3 storage (SHA-256 hash as key, dedup via HeadObject before upload)
|
||||
- Magic byte validation (PNG, GIF, JPEG, ICO, BMP, WebP, SVG)
|
||||
- ICO directory parsing for dimensions (picks largest ≤64x64)
|
||||
- Filters to eligible icons only: `favicon_ico` + link_rel with no declared size or ≤64x64
|
||||
- md5(id) shuffle in claim query to spread requests across hosts
|
||||
- Panic recovery per worker, DB errors tracked and logged
|
||||
|
||||
Dependencies:
|
||||
- `github.com/jackc/pgx/v5` — Postgres
|
||||
- `github.com/aws/aws-sdk-go-v2` — S3 uploads
|
||||
- `github.com/schollz/progressbar/v3` — Progress bar
|
||||
- Standard library `image` + `image/png`, `image/gif`, `image/jpeg` for decoding dimensions
|
||||
- `golang.org/x/image/webp` — WebP decoding
|
||||
- ICO parsing: write a minimal decoder (ICO format is simple — 6-byte header + directory entries pointing to BMP/PNG data) or find a maintained library at implementation time
|
||||
**CLI:** `./icon_download --db URL [--s3-bucket NAME] [--concurrency N] [--batch-size N] [--timeout D] [--max-size N] [--limit N] [--dry-run] [--log-file PATH] [--log-errors-only]`
|
||||
|
||||
### Step 3.2: Work Claiming + Download Logic
|
||||
|
||||
Implement:
|
||||
1. Claim batch with randomized order (md5 shuffle, FOR UPDATE SKIP LOCKED)
|
||||
2. For each icon URL:
|
||||
- HTTP GET with timeouts (5s dial, 10s total)
|
||||
- Read up to max-size bytes, abort if exceeded
|
||||
- Validate magic bytes (PNG: `\x89PNG`, GIF: `GIF8`, ICO: `\x00\x00\x01\x00`, etc.)
|
||||
- Determine actual content type from magic bytes (don't trust HTTP Content-Type)
|
||||
- Decode dimensions:
|
||||
- PNG/GIF/JPEG/WebP/BMP: read image header (Go `image.DecodeConfig`)
|
||||
- ICO: parse directory entries, find largest at standard size ≤64x64
|
||||
- SVG: set width=NULL, height=NULL
|
||||
- Compute SHA-256 of full content
|
||||
- Check if S3 key exists (HEAD request); if yes, skip upload (dedup)
|
||||
- Upload to S3 if new
|
||||
3. Update icons row with results (or error)
|
||||
|
||||
**Dry-run test:** `--limit 200 --dry-run` — prints what it would do for 200 icons. Check URLs, detected types, dimensions.
|
||||
|
||||
**Done when:** Can download, validate, and upload icons for a small batch.
|
||||
|
||||
### Step 3.3: Full 100K Icon Run
|
||||
|
||||
Run against all icons in the database (likely 150K-300K icon rows for 100K hosts).
|
||||
|
||||
Monitor:
|
||||
- icons/sec throughput
|
||||
- Error breakdown (DNS failures, timeouts, HTTP errors, invalid images)
|
||||
- S3 dedup hit rate
|
||||
- Memory usage (adjust concurrency if needed)
|
||||
|
||||
**Validation:**
|
||||
- `SELECT scan_state, COUNT(*) FROM icons GROUP BY scan_state;` — expect mostly completed, some failed
|
||||
- `SELECT error, COUNT(*) FROM icons WHERE scan_state = 'failed' GROUP BY error ORDER BY count DESC LIMIT 20;` — understand failure modes
|
||||
- `aws s3 ls s3://everytab-icons/ | wc -l` — confirm icons in S3
|
||||
- Spot-check: download a few icons from S3, open them, verify they're valid images
|
||||
|
||||
**Stats:** `stats/03_icon_download.json`
|
||||
|
||||
**Done when:** Icon download complete for 100K dev set, error rate understood, S3 populated.
|
||||
**Result (100K hosts, ~224K eligible icons):**
|
||||
- Duration: 10m36s (351 icons/sec)
|
||||
- Completed: 156,214 (70%)
|
||||
- Failed: 67,459 (30% — mostly HTTP 404s from stale crawl data)
|
||||
- Dedup hits: 55,771 (25% — shared Wix/WordPress/hosted platform favicons)
|
||||
- Downloaded: 1.9GB
|
||||
- DNS errors: 1,668 | Timeouts: 2,129 | HTTP errors: 47,565 | Invalid: 11,803 | Too large: 777
|
||||
- DB errors: 0 | Panics: 0
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -650,6 +604,22 @@ On completion, each program prints a summary line and writes its stats JSON (wit
|
|||
- Increasing concurrency from 100 to 500 didn't improve throughput (~300 hosts/sec either way). The bottleneck is likely Postgres write latency or S3 per-connection bandwidth, not parallelism. Could investigate batch inserts for the full run.
|
||||
- Progress bars and per-item log lines don't mix well in terminals. Pick one or write progress to a separate channel (file, stderr).
|
||||
|
||||
### Phase 3 — Completed 2026-05-18
|
||||
|
||||
**Changes from original plan:**
|
||||
- Filtered eligible icons before downloading: skip link_rel icons with declared size >64x64 (apple-touch-icon bloat). Reduced download count from ~302K to ~224K.
|
||||
- Channel-based worker pool instead of semaphore pattern — producer goroutine feeds work channel, N workers consume. No starvation between batch claims.
|
||||
- Shared http.Transport for connection pooling (marginal benefit since hosts are unique, but reduces GC pressure).
|
||||
- No progress bar — same approach as Phase 2 (log lines + summary).
|
||||
- User-Agent set to `EveryTabBot/1.0` with link to `everytab.site/bot` for bot identification.
|
||||
|
||||
**Lessons learned:**
|
||||
- 70% icon download success rate is expected — most failures are 404s from domains/pages that changed since the crawl. This is acceptable loss.
|
||||
- 25% dedup rate — many hosted platforms (Wix, WordPress.com, Squarespace) serve identical default favicons. Content-addressed S3 storage handles this efficiently.
|
||||
- `data.commoncrawl.org` rate-limits HTTPS but S3 does not — same pattern as WARC parsing. Use S3 API for all CC access.
|
||||
- Favicon download is I/O bound (network latency to diverse hosts worldwide). Concurrency helps up to a point, then the long tail of slow/dead servers dominates. 351 icons/sec at 200 concurrency.
|
||||
- Invalid image detection (magic bytes) catches ~5% of "successful" downloads that are actually HTML error pages served at `/favicon.ico`.
|
||||
|
||||
---
|
||||
|
||||
## Future Improvements
|
||||
|
|
@ -659,3 +629,5 @@ On completion, each program prints a summary line and writes its stats JSON (wit
|
|||
- **WARC parser: investigate throughput ceiling** — 300 hosts/sec at both 100 and 500 concurrency suggests a bottleneck. Profile to determine if it's S3 response latency, Postgres writes, or something else. For the full 30M run this determines wall-clock time (~28 hours at current rate).
|
||||
- **CC-Index query: c5.2xlarge for full run** — 8GB is tight with 6.4GB usage + swap. 16GB instance for the 30M-host full run.
|
||||
- **Encoding: investigate remaining garbled titles** — Some titles still show `<60>` in output (e.g., `BERGSTRANDS BAGERI <20>...`). These are pages that lie about their encoding. Could try more aggressive charset detection heuristics.
|
||||
- **Icon download: retry transient failures** — DNS and timeout failures could benefit from a single retry. Would recover a small percentage of icons.
|
||||
- **Icon download: download large link_rel icons** — Currently skipping declared sizes >64x64. Re-run with broader filter for future high-res projects.
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue