` and `<link rel="icon">` tags. --- ## Phase 2: WARC Parsing (Stage 2) [COMPLETED] ### Steps 2.1-2.3 [COMPLETED] Binary: `pipeline/02_warc_parse/` (5 files: main.go, warc.go, parser.go, process.go, db.go, log.go) **Architecture:** - Fetches WARC records via AWS SDK S3 byte-range GetObject (using EC2 instance profile credentials) - Parses WARC records with `github.com/nlnwa/gowarc/v3` - Parses HTML with `golang.org/x/net/html` tokenizer (lenient, stops at `<body>`) - Detects charset via `golang.org/x/net/html/charset` and converts to UTF-8 - Sanitizes titles with `strings.ToValidUTF8` as final safety net - Concurrent goroutine pool with configurable concurrency - Per-host log lines to stdout + optional log file - Panic recovery per goroutine (logs PANIC, doesn't mark row as parsed) - DB errors tracked and logged with `DB_ERROR:` prefix **CLI:** `./warc_parse --db URL [--concurrency N] [--batch-size N] [--limit N] [--dry-run] [--log-file PATH] [--log-errors-only]` **Result (100K hosts, concurrency 500):** - Duration: 5m31s (~300 hosts/sec) - Titles found: 93,384 (93%) - Icons found: 201,780 (~2 per host) - Iframe blocked: 17,855 (18%) - Fetch errors: 3 - DB errors: 0 - Panics: 0 --- ## Phase 3: Icon Download (Stage 3) [COMPLETED] ### Steps 3.1-3.3 [COMPLETED] Binary: `pipeline/03_icon_download/` (6 files: main.go, download.go, image.go, s3.go, db.go, log.go) **Architecture:** - Channel-based work distribution: producer goroutine claims batches, N worker goroutines consume from buffered channel (no worker starvation) - Shared `http.Transport` for connection pooling / TLS session reuse - Content-addressed S3 storage (SHA-256 hash as key, dedup via HeadObject before upload) - Magic byte validation (PNG, GIF, JPEG, ICO, BMP, WebP, SVG) - ICO directory parsing for dimensions (picks largest ≤64x64) - Filters to eligible icons only: `favicon_ico` + link_rel with no declared size or ≤64x64 - md5(id) shuffle in claim query to spread requests across hosts - Panic recovery per worker, DB errors tracked and logged **CLI:** `./icon_download --db URL [--s3-bucket NAME] [--concurrency N] [--batch-size N] [--timeout D] [--max-size N] [--limit N] [--dry-run] [--log-file PATH] [--log-errors-only]` **Result (100K hosts, ~224K eligible icons):** - Duration: 10m36s (351 icons/sec) - Completed: 156,214 (70%) - Failed: 67,459 (30% — mostly HTTP 404s from stale crawl data) - Dedup hits: 55,771 (25% — shared Wix/WordPress/hosted platform favicons) - Downloaded: 1.9GB - DNS errors: 1,668 | Timeouts: 2,129 | HTTP errors: 47,565 | Invalid: 11,803 | Too large: 777 - DB errors: 0 | Panics: 0 --- ## Phase 4: Best Icon Selection & Bundle Generation (Stages 4-5) ### Step 4.1: Best Icon Selection SQL Write `pipeline/04_best_icon/select.sql`: ```sql UPDATE hosts h SET best_icon_s3_key = sub.s3_key FROM ( SELECT DISTINCT ON (i.host_id) i.host_id, i.s3_key FROM icons i WHERE i.scan_state = 'completed' ORDER BY i.host_id, CASE WHEN i.width = i.height AND i.width IN (64, 48, 32, 16) THEN 0 WHEN i.width = i.height AND i.width <= 64 THEN 1 WHEN i.width IS NOT NULL AND i.width <= 64 AND i.height <= 64 THEN 2 ELSE 3 END, COALESCE(i.width, 0) DESC, CASE WHEN i.content_type IN ('image/png', 'image/gif', 'image/x-icon', 'image/vnd.microsoft.icon') THEN 0 WHEN i.content_type = 'image/webp' THEN 1 WHEN i.content_type = 'image/svg+xml' THEN 2 ELSE 3 END, i.file_size ASC ) sub WHERE h.id = sub.host_id; ``` **Validation:** - `SELECT COUNT(*) FROM hosts WHERE best_icon_s3_key IS NOT NULL;` — expect 60-80% of hosts - `SELECT COUNT(*) FROM hosts WHERE html_title IS NOT NULL AND best_icon_s3_key IS NULL;` — hosts with title but no icon (will still be in bundles) - Spot-check: for a few hosts, verify the selected icon is reasonable (correct size, valid) **Stats:** `stats/04_best_icon.json` **Done when:** best_icon_s3_key populated for hosts that have valid icons. ### Step 4.2: Bundle Generator Go Program ``` pipeline/05_bundle_gen/ ├── main.go # Entry point, CLI flags ├── db.go # Query hosts + icon keys ├── convert.go # Icon format conversion → PNG ├── bundle.go # Chunk + serialize JSON └── s3.go # Upload bundles to everytab-site ``` CLI flags: - `--db` connection string - `--icons-bucket` (default `everytab-icons`) - `--site-bucket` (default `everytab-site`) - `--entries-per-bundle` (tunable, start at 120) - `--dry-run` (generate bundles to local disk, don't upload) - `--limit` (only process N hosts, for testing) ### Step 4.3: Icon Conversion Logic Implement format conversion to PNG: 1. Download icon from S3 by key 2. Detect format from magic bytes 3. Decode: - PNG: decode directly - ICO: parse container, extract image at recorded width/height, decode BMP or PNG within - GIF/JPEG/BMP/WebP: decode to RGBA - SVG: rasterize to 32x32 (use a Go SVG library, or shell out to `rsvg-convert` if simpler) 4. Re-encode as PNG (optimized, don't upscale) 5. Base64-encode **Test:** Convert 50 icons of mixed formats manually, verify output PNGs look correct. ### Step 4.4: Bundle Assembly + Upload Implement: 1. Query all hosts WHERE html_title IS NOT NULL, randomize (ORDER BY random()) 2. For each host: fetch + convert its icon (or set empty string if no icon) 3. Assemble entries into chunks of `ENTRIES_PER_BUNDLE` 4. Serialize each chunk as JSON (`tabs/{n}.json`) 5. Upload to S3 `everytab-site/tabs/` 6. Record total bundle count **Dry-run:** Generate bundles to local disk, inspect a few: - Valid JSON - Icons render in browser (paste a data:image/png;base64,... URI) - Entries have host, title, icon, icon_w, icon_h, iframe_ok **Validation:** - Bundle files exist in S3 - `aws s3 ls s3://everytab-site/tabs/ | wc -l` matches expected count - Random bundle can be fetched and parsed as JSON - Total hosts across all bundles = count of hosts with titles **Stats:** `stats/05_bundle_gen.json` **Done when:** All bundles uploaded to S3, JSON is valid, icons render. --- ## Phase 5: Frontend (Stage 6) Begins after Phase 4 is complete — we use real bundle data from the 100K pipeline run for frontend development. ### Step 5.1: Local Dev Server Serve the generated bundles from S3 locally for frontend development: ```bash # Sync a few bundles locally for testing aws s3 sync s3://everytab-site/tabs/ ./local-tabs/ --max-items 10 # Serve with any static file server python -m http.server 8000 ``` **Done when:** Can fetch real bundle JSON from a local dev server. ### Step 5.2: Basic Tab Rendering Build `frontend/index.html` and `frontend/site.js`: 1. HTML: minimal shell with a container div, inline CSS for tab styling 2. JS: fetch a bundle, render tabs as rows filling the viewport 3. Tab appearance: mimic Firefox tab shape (rounded top corners, slight border) 4. Each tab shows favicon (16x16 or 32x32 img from data URI) + truncated title 5. No-icon tabs show title only Focus: get the visual density right. How many tabs fit across? How many rows fill the viewport? This determines `ENTRIES_PER_BUNDLE`. **Done when:** Page renders tabs from a mock bundle. Visually looks like a page full of browser tabs. ### Step 5.3: Marquee Animation Add horizontal marquee to each row: - CSS `@keyframes` animation, translateX - Each row at slightly different speed and direction (some left, some right) - Smooth, subtle movement — not distracting, just enough to feel alive - Rows need extra tabs beyond viewport width to avoid gaps during scroll **Done when:** Rows scroll smoothly, no visual glitches at edges. ### Step 5.4: Interaction — Click, Iframe, Close Implement tab click behavior: 1. If `iframe_ok`: show an overlay with iframe loading the site (`{protocol}://{hostname}`) 2. If `!iframe_ok`: open in new tab (`target="_blank"`, add rel="noopener") 3. Visual indicator on tabs that will open externally (small icon/badge) 4. Close overlay: X button + click-outside + Escape key **Done when:** Clicking tabs works correctly for both iframe and external cases. ### Step 5.5: Infinite Scroll + Random Bundle Loading Implement: 1. Seeded PRNG using `Date.now()` — generates deterministic sequence of bundle indices 2. On page load: fetch first bundle, render 3. Scroll detection: when user approaches bottom, fetch next random bundle 4. Track loaded bundle IDs in a Set (no duplicates) 5. Append new rows below existing ones 6. Handle edge case: all bundles loaded (unlikely with 50K+ bundles but handle gracefully) `TOTAL_BUNDLES` is a constant baked into the JS at build time. **Done when:** Infinite scroll works, new bundles load seamlessly, no duplicate bundles. ### Step 5.6: Frontend Build Script Write `pipeline/06_frontend/build.sh`: 1. Read total bundle count (from pipeline output or S3) 2. Inject `const TOTAL_BUNDLES = {M};` into site.js 3. Copy index.html + site.js to S3 `everytab-site/` 4. Invalidate CloudFront (if distribution exists) **Done when:** Build script produces deployable frontend with correct bundle count. --- ## Phase 6: Integration & End-to-End Test (100K) ### Step 6.1: Run Full Pipeline (100K) Execute all stages in sequence on EC2: 1. Verify hosts table has 100K entries (from Phase 1) 2. Run WARC parser (Phase 2) — should complete in minutes 3. Run icon downloader (Phase 3) — should complete in 10-30 minutes at 100K scale 4. Run best icon selection (Phase 4.1) 5. Run bundle generator (Phase 4.2-4.4) 6. Run frontend build (Phase 5.6) **Validation:** Visit the CloudFront URL. The site should work: - Tabs render with real favicons and titles - Clicking works (iframe + external) - Scrolling loads more tabs - No JS console errors ### Step 6.2: Tune Parameters Based on the 100K run: - **ENTRIES_PER_BUNDLE:** Look at the live site. Does one bundle fill the screen? Too many tabs? Too few? Adjust. - **Concurrency:** Was the icon download memory-stable? CPU-bound or network-bound? Adjust goroutine pool size. - **Timeouts:** What was the error distribution? Are timeouts too aggressive? Too lenient? - **Icon selection:** Do the selected icons look good? Any weird sizes or broken images? Update CLI flag defaults based on findings. ### Step 6.3: Collect & Review Stats Merge all `stats/*.json` into a single pipeline report. Review: - Loss at each stage (domains → parsed → icons downloaded → icons selected → bundled) - Time per stage - Error patterns (are certain TLDs failing more? certain icon formats?) - Storage usage (S3 icons bucket, S3 site bucket) Identify any pipeline bugs or data quality issues. Fix before scaling up. **Done when:** End-to-end works at 100K, parameters tuned, stats reviewed, bugs fixed. --- ## Phase 7: Full-Scale Run (30M) ### Step 7.1: Remove Limits, Re-run CC-Index Query Update the DuckDB query to remove `LIMIT 100000`. Re-run. Considerations: - If httpfs takes >1hr, switch to downloading the parquet files first - May need to increase RDS storage (30M rows with WARC paths ≈ 5-10GB) - Monitor DuckDB memory usage **Validation:** `SELECT COUNT(*) FROM hosts;` shows ~30M rows. ### Step 7.2: Run WARC Parser at Scale Run with full concurrency against 30M hosts. Expected time: 2-6 hours. Monitor: - Throughput (hosts/sec) - Error rate stability (should plateau, not climb) - Postgres connection pool health - Memory usage ### Step 7.3: Run Icon Downloader at Scale This is the long pole — expected 12-48 hours. Monitor continuously: - icons/sec rate - DNS cache hit rate (check Unbound stats: `unbound-control stats`) - S3 upload rate - Error rate by type - Completion percentage If too slow (projected >48hrs): - Consider increasing concurrency (if memory allows) - Consider spinning up fleet (add more EC2 instances running the same binary) - Check if DNS is the bottleneck (Unbound stats) - Check if S3 uploads are the bottleneck (batch or reduce HEAD checks) ### Step 7.4: Best Icon Selection + Bundle Generation Run at full scale. Expected: 1-2 hours total. Monitor bundle sizes — verify they're in the expected range with `ENTRIES_PER_BUNDLE` from tuning. ### Step 7.5: Rebuild Frontend + Deploy Run frontend build with the real bundle count. Invalidate CloudFront. **Validation:** Visit the live site. Browse around. Check: - Tab variety (seeing diverse sites, not just one TLD) - Icon quality (no broken images, reasonable sizes) - Performance (bundles load quickly, no jank) - Stats page / stats.json looks correct **Done when:** Full-scale site is live and working. --- ## Phase 8: Backup & Teardown ### Step 8.1: Backup RDS to Homelab ```bash # On EC2 (fast connection to RDS): pg_dump -Fc $DATABASE_URL > everytab_dump.pgfc # Transfer to homelab (from EC2 or direct): scp everytab_dump.pgfc homelab:/backups/everytab/ # On homelab, verify restore: pg_restore -d everytab_local everytab_dump.pgfc psql everytab_local -c "SELECT COUNT(*) FROM hosts; SELECT COUNT(*) FROM icons;" ``` ### Step 8.2: Backup Icons S3 to Homelab ```bash # From homelab (or EC2 as intermediary): aws s3 sync s3://everytab-icons/ /backups/everytab/icons/ # Verify file count matches: ls /backups/everytab/icons/ | wc -l # Compare with: aws s3 ls s3://everytab-icons/ | wc -l ``` ### Step 8.3: Verify & Teardown After confirming backups: ```bash # Verify the live site still works (it only depends on everytab-site + CloudFront) curl -s https://your-cloudfront-domain.net/ | head # Teardown scanning infrastructure: aws rds delete-db-instance --db-instance-identifier everytab --skip-final-snapshot aws s3 rb s3://everytab-icons --force aws ec2 terminate-instances --instance-ids i-xxxxx ``` **Done when:** Only `everytab-site` S3 bucket + CloudFront remain running. Monthly cost: ~$2-4. --- ## Development Notes ### Execution Order Phases are sequential: 0 → 1 → 2 → 3 → 4 → 5 → 6 → 7 → 8. Frontend (Phase 5) uses real data from the 100K pipeline run. The only thing that can be developed ahead of time is writing Go code locally before EC2 is ready (compile-test locally, run on EC2). ### Progress & Observability All Go programs have two output modes running simultaneously: **Per-item log lines** (stdout, above the progress bar): - WARC parser: `parsed: example.com 200 "Example Domai..." ok` or `parsed: broken.net 200 "" err:no_title` - Icon downloader: `icon: https://example.com/favicon.ico 32x32 png 4.2KB ok` or `icon: https://fail.org/favicon.ico err:timeout` - Bundle generator: `bundle: 0042.json 120 entries 247KB ok` Each line is a short, fixed-format summary — hostname/URL, key result, and status. Keeps it scannable when running live. **Log file** (`--log-file path/to/out.log`): If provided, mirror all per-item log lines to disk. For full-scale runs, consider using `--log-errors-only` flag to only write error lines to the log file (avoids filling disk with 30M success lines). Without `--log-file`, logs only go to stdout. **Progress bar** (bottom of terminal, `schollz/progressbar`): - Items processed / total items - Processing rate (items/sec) - ETA - Error count On completion, each program prints a summary line and writes its stats JSON (with started_at, finished_at, duration_seconds, and stage-specific counters). ### Testing Strategy - **Dry-run flags** on all Go programs: print what would happen without mutating DB/S3 - **--limit flags** on all Go programs: process a small subset quickly - **Spot-checks:** after each stage, manually verify 5-10 random entries - **Stats files:** compare counts between stages to catch data loss - **100K dev set:** full pipeline at small scale before committing to a 24hr+ full run ### Common Pitfalls to Watch For - **DuckDB CC-Index path:** The exact S3 path to parquet files changes per crawl. Check Common Crawl's website for the latest crawl ID and index location. - **WARC record format:** WARC records have a specific envelope format (WARC/1.0 header, blank line, HTTP response). Don't assume the HTTP response starts at byte 0. - **Relative icon URLs:** `/favicon.ico` is relative to root, but `favicon.ico` (no leading slash) is relative to the page path. Since we only have root pages (`/`), both resolve the same. But `../icons/fav.png` could be tricky — handle gracefully or skip. - **ICO files are complex:** The ICO container format can embed BMP (with a modified header) or PNG. Many "ICO" files are actually just PNGs renamed to .ico. Check magic bytes, not file extension. - **SVG rasterization:** Go doesn't have great native SVG support. Consider shelling out to `rsvg-convert` or `librsvg`, or use a Go library like `github.com/nicholasgasior/goresvg`. This can be a follow-up if SVG icons are rare. - **Postgres connection limits:** RDS db.t3.medium has max_connections ≈ 80. With 1000 goroutines, we need connection pooling (pgx pool handles this). Set pool max to ~40 connections. - **S3 eventual consistency:** After uploading an icon, a HEAD request might not find it immediately. For dedup checks, handle "not found" gracefully (just upload again — idempotent since key is content hash). - **CloudFront caching:** After deploying new bundles, invalidate `/*` or set short TTL during development. For production, use long TTLs (bundles are immutable between crawls). --- ## Progress Log ### Phase 0 — Completed 2026-05-17 **Changes from original plan:** - Replaced shell scripts (`setup.sh`, `teardown.sh`) with Terraform (`infra/main.tf`). Single file, `var.scanning` bool switches between scanning and serving phases. - SSH key is Terraform-managed (no passphrase, stored in state) rather than manually generated. - CloudFront distribution deferred — not created in Phase 0, will add to Terraform when frontend is ready. - Added `infra/README.md` with terse setup steps for future replication. **Lessons learned:** - Shell scripts with `2>/dev/null || echo "already exists"` swallow real errors. Terraform's declarative model avoids this entirely — errors are always surfaced. - RDS requires a DB subnet group (2+ subnets in different AZs). The original shell script didn't create one, causing a silent failure. Terraform handles this dependency automatically. - Amazon Linux 2023 uses `systemd-resolved` which manages `/etc/resolv.conf`. Must disable it before pointing resolv.conf at Unbound. `chattr +i` doesn't work on the symlink. - AWS EC2 key pairs created via API don't support passphrases. Use `tls_private_key` in Terraform or generate locally with `ssh-keygen` + import. - When an AWS key pair name already exists from a previous run, Terraform may not regenerate it. Use `-replace` to force recreation of the key + instance together. ### Phase 1 (Steps 1.1-1.2) — Completed 2026-05-17 **Changes from original plan:** - Used DuckDB `aws` extension with `CREDENTIAL_CHAIN` instead of httpfs anonymous access. The commoncrawl S3 bucket requires authenticated requests. - IAM role needed explicit `s3:GetObject` and `s3:ListBucket` on `arn:aws:s3:::commoncrawl/*` — the bucket doesn't allow cross-account access based on bucket policy alone. - Used `GROUP BY` with `first(... ORDER BY ...)` instead of `ROW_NUMBER()` window function. More memory-efficient (hash aggregation vs sort), cleaner syntax. - DuckDB can glob `s3://.../subset=warc/*.parquet` directly (300 files) — no need to fetch a file list or download parquet locally. - Dropped the `url_port IN (80, 443)` filter — CC stores standard ports as NULL, not 80/443. Replaced with `url_port IS NULL`. **Lessons learned:** - DuckDB URL-encodes `=` in S3 paths (e.g., `crawl%3DCC-MAIN-2026-17`) but S3 decodes it correctly. The real issue was always IAM permissions, not path encoding. - The `commoncrawl` S3 bucket requires valid AWS credentials for both GetObject and ListBucket. Anonymous access (unsigned requests) does not work. Any valid IAM identity works as long as their policy allows it. - DuckDB's LIMIT can interact unexpectedly with GROUP BY — the optimizer may stop reading input early once it has enough groups. This wasn't our issue (it was the port filter) but worth noting for future queries. - CC-Index stores `url_port` as NULL for standard ports (80/443), not as the integer. Always check actual column values before writing filters. - c5.xlarge (8GB) is tight for this query — uses 6.4GB + swap. For the full 30M run, use c5.2xlarge (16GB). - Query takes ~692s (11.5 min) for 100K output rows reading all 300 parquet files. Full run without LIMIT will be similar duration but more memory for the hash table. ### Phase 2 — Completed 2026-05-17 **Changes from original plan:** - Used AWS SDK S3 GetObject for WARC byte-range requests instead of HTTPS to `data.commoncrawl.org`. The HTTPS endpoint rate-limits at ~100 concurrent connections (429s). S3 has no such limit. - Removed progress bar — it interfered with per-host log lines. Replaced with clean stdout log lines + summary at end. Check DB for mid-run progress. - Added `process.go` and `log.go` files (plan had 4 files, we have 6 — cleaner separation). - Added charset detection + UTF-8 conversion (`golang.org/x/net/html/charset` + `golang.org/x/text/transform`) for international titles. - Added `strings.ToValidUTF8` sanitization as final safety net for titles that still have invalid bytes after charset conversion. - Panic recovery per goroutine — logs `PANIC:` prefix, doesn't mark row as parsed (retryable on next run). - DB write errors tracked separately (`DB_ERROR:` prefix, counted in summary + stats JSON). **Lessons learned:** - `data.commoncrawl.org` aggressively rate-limits (403/429) at ~100 concurrent connections. Use S3 API directly for high-concurrency access. - Many Chinese/Japanese sites serve GBK or other non-UTF-8 encodings without declaring it in Content-Type or `<meta>`. `charset.DetermineEncoding` catches most but not all. `strings.ToValidUTF8` as final sanitization prevents Postgres encoding errors. - gowarc's `HttpHeader()` can return nil for malformed records — always nil-check library return values defensively. - Increasing concurrency from 100 to 500 didn't improve throughput (~300 hosts/sec either way). The bottleneck is likely Postgres write latency or S3 per-connection bandwidth, not parallelism. Could investigate batch inserts for the full run. - Progress bars and per-item log lines don't mix well in terminals. Pick one or write progress to a separate channel (file, stderr). ### Phase 3 — Completed 2026-05-18 **Changes from original plan:** - Filtered eligible icons before downloading: skip link_rel icons with declared size >64x64 (apple-touch-icon bloat). Reduced download count from ~302K to ~224K. - Channel-based worker pool instead of semaphore pattern — producer goroutine feeds work channel, N workers consume. No starvation between batch claims. - Shared http.Transport for connection pooling (marginal benefit since hosts are unique, but reduces GC pressure). - No progress bar — same approach as Phase 2 (log lines + summary). - User-Agent set to `EveryTabBot/1.0` with link to `everytab.site/bot` for bot identification. **Lessons learned:** - 70% icon download success rate is expected — most failures are 404s from domains/pages that changed since the crawl. This is acceptable loss. - 25% dedup rate — many hosted platforms (Wix, WordPress.com, Squarespace) serve identical default favicons. Content-addressed S3 storage handles this efficiently. - `data.commoncrawl.org` rate-limits HTTPS but S3 does not — same pattern as WARC parsing. Use S3 API for all CC access. - Favicon download is I/O bound (network latency to diverse hosts worldwide). Concurrency helps up to a point, then the long tail of slow/dead servers dominates. 351 icons/sec at 200 concurrency. - Invalid image detection (magic bytes) catches ~5% of "successful" downloads that are actually HTML error pages served at `/favicon.ico`. --- ## Future Improvements - **WARC parser: retry on fetch errors** — Currently 3 fetch errors out of 100K (tolerable loss). Could add 1 retry with backoff for transient S3 errors. - **WARC parser: batch DB inserts** — Currently one INSERT per icon. Using pgx batch or CopyFrom could improve DB write throughput and potentially unblock higher concurrency. - **WARC parser: investigate throughput ceiling** — 300 hosts/sec at both 100 and 500 concurrency suggests a bottleneck. Profile to determine if it's S3 response latency, Postgres writes, or something else. For the full 30M run this determines wall-clock time (~28 hours at current rate). - **CC-Index query: c5.2xlarge for full run** — 8GB is tight with 6.4GB usage + swap. 16GB instance for the 30M-host full run. - **Encoding: investigate remaining garbled titles** — Some titles still show `�` in output (e.g., `BERGSTRANDS BAGERI �...`). These are pages that lie about their encoding. Could try more aggressive charset detection heuristics. - **Icon download: retry transient failures** — DNS and timeout failures could benefit from a single retry. Would recover a small percentage of icons. - **Icon download: download large link_rel icons** — Currently skipping declared sizes >64x64. Re-run with broader filter for future high-res projects.

# EveryTab Implementation Plan This plan builds the system described in ARCHITECTURE.md in incremental steps. We start with 100K hosts to validate the pipeline end-to-end, then scale to the full ~30M. Each step has a clear deliverable and validation criteria. Steps are sequential — each phase builds on the previous. --- ## Phase 0: Project Setup & AWS Infrastructure [COMPLETED] ### Step 0.1: Repository Structure [COMPLETED] ``` everytab/ ├── design.md ├── ARCHITECTURE.md ├── PLAN.md ├── infra/ │ ├── main.tf # Terraform: all AWS resources │ ├── terraform.tfvars.example │ ├── ec2-userdata.sh # EC2 bootstrap (Go, DuckDB, Unbound) │ └── README.md # Setup steps ├── pipeline/ │ ├── 01_cc_index/ │ │ └── schema.sql # Postgres table definitions │ ├── 02_warc_parse/ │ ├── 03_icon_download/ │ ├── 04_best_icon/ │ ├── 05_bundle_gen/ │ └── 06_frontend/ ├── frontend/ ├── stats/ # gitignored └── go.mod ``` ### Step 0.2: AWS Infrastructure (Terraform) [COMPLETED] Infrastructure managed via `infra/main.tf`. Single file, uses `var.scanning` bool to switch phases: - `terraform apply` — creates all scanning resources (EC2, RDS, S3 icons, S3 site, IAM, security groups) - `terraform apply -var="scanning=false"` — destroys scanning resources, keeps site bucket - `terraform destroy` — removes everything Resources created: - S3 `everytab-icons` (private), S3 `everytab-site` (for CloudFront later) - RDS Postgres 16, db.t3.medium, 20GB gp3 - EC2 c5.xlarge, Amazon Linux 2023, 50GB gp3 - Security groups (SSH from home IP, RDS from EC2 only) - IAM role + instance profile (S3 access only) - SSH key (Terraform-managed ed25519) ### Step 0.3: EC2 Environment Setup [COMPLETED] Bootstrap via `infra/ec2-userdata.sh`: - Go 1.22+, DuckDB (httpfs + postgres extensions), Unbound (recursive resolver), psql, tmux - Unbound configured as system resolver (systemd-resolved disabled) - DATABASE_URL in .bashrc - Schema applied: hosts + icons tables with indexes --- ## Phase 1: CC-Index Query (Stage 1) ### Step 1.1: Database Schema Create the Postgres tables. Run via `psql`: ```sql CREATE TABLE hosts ( id SERIAL PRIMARY KEY, hostname TEXT NOT NULL UNIQUE, protocol TEXT NOT NULL, crawl_id TEXT NOT NULL, warc_filename TEXT NOT NULL, warc_record_offset BIGINT NOT NULL, warc_record_length INT NOT NULL, html_title TEXT, iframe_allowed BOOLEAN, best_icon_s3_key TEXT, parsed BOOLEAN DEFAULT FALSE ); CREATE TABLE icons ( id SERIAL PRIMARY KEY, host_id INT NOT NULL REFERENCES hosts(id), url TEXT NOT NULL, source TEXT NOT NULL, rel_type TEXT, rel_sizes TEXT, content_type TEXT, width INT, height INT, file_size INT, s3_key TEXT, scan_state TEXT DEFAULT 'unscanned', error TEXT ); CREATE INDEX idx_hosts_parsed ON hosts(id) WHERE parsed = FALSE; CREATE INDEX idx_icons_unscanned ON icons(id) WHERE scan_state = 'unscanned'; CREATE INDEX idx_icons_host_id ON icons(host_id); ``` **Done when:** Tables exist in RDS, schema matches ARCHITECTURE.md. ### Step 1.2: DuckDB CC-Index Query (100K limit) [COMPLETED] Script: `pipeline/01_cc_index/query.sh` Uses DuckDB with `aws` extension (credential chain) to read parquet directly from `s3://commoncrawl/.../*.parquet` glob, with the `postgres` extension to write results into RDS. Auto-detects latest crawl ID from the CC API. Deduplication via `GROUP BY url_host_name` with `first(... ORDER BY ...)` aggregates (hash aggregation — more memory-efficient than window functions). **Result:** 100K hosts, 77% https / 23% http, completed in 692s. **Done when:** 100K hosts in the database with valid WARC coordinates. ### Step 1.3: Validate WARC Coordinates [COMPLETED] Manually fetched WARC records with curl byte-range requests to `data.commoncrawl.org`. Confirmed valid WARC headers, HTTP response, and HTML with `` and `<link rel="icon">` tags. --- ## Phase 2: WARC Parsing (Stage 2) [COMPLETED] ### Steps 2.1-2.3 [COMPLETED] Binary: `pipeline/02_warc_parse/` (5 files: main.go, warc.go, parser.go, process.go, db.go, log.go) **Architecture:** - Fetches WARC records via AWS SDK S3 byte-range GetObject (using EC2 instance profile credentials) - Parses WARC records with `github.com/nlnwa/gowarc/v3` - Parses HTML with `golang.org/x/net/html` tokenizer (lenient, stops at `<body>`) - Detects charset via `golang.org/x/net/html/charset` and converts to UTF-8 - Sanitizes titles with `strings.ToValidUTF8` as final safety net - Concurrent goroutine pool with configurable concurrency - Per-host log lines to stdout + optional log file - Panic recovery per goroutine (logs PANIC, doesn't mark row as parsed) - DB errors tracked and logged with `DB_ERROR:` prefix **CLI:** `./warc_parse --db URL [--concurrency N] [--batch-size N] [--limit N] [--dry-run] [--log-file PATH] [--log-errors-only]` **Result (100K hosts, concurrency 500):** - Duration: 5m31s (~300 hosts/sec) - Titles found: 93,384 (93%) - Icons found: 201,780 (~2 per host) - Iframe blocked: 17,855 (18%) - Fetch errors: 3 - DB errors: 0 - Panics: 0 --- ## Phase 3: Icon Download (Stage 3) [COMPLETED] ### Steps 3.1-3.3 [COMPLETED] Binary: `pipeline/03_icon_download/` (6 files: main.go, download.go, image.go, s3.go, db.go, log.go) **Architecture:** - Channel-based work distribution: producer goroutine claims batches, N worker goroutines consume from buffered channel (no worker starvation) - Shared `http.Transport` for connection pooling / TLS session reuse - Content-addressed S3 storage (SHA-256 hash as key, dedup via HeadObject before upload) - Magic byte validation (PNG, GIF, JPEG, ICO, BMP, WebP, SVG) - ICO directory parsing for dimensions (picks largest ≤64x64) - Filters to eligible icons only: `favicon_ico` + link_rel with no declared size or ≤64x64 - md5(id) shuffle in claim query to spread requests across hosts - Panic recovery per worker, DB errors tracked and logged **CLI:** `./icon_download --db URL [--s3-bucket NAME] [--concurrency N] [--batch-size N] [--timeout D] [--max-size N] [--limit N] [--dry-run] [--log-file PATH] [--log-errors-only]` **Result (100K hosts, ~224K eligible icons):** - Duration: 10m36s (351 icons/sec) - Completed: 156,214 (70%) - Failed: 67,459 (30% — mostly HTTP 404s from stale crawl data) - Dedup hits: 55,771 (25% — shared Wix/WordPress/hosted platform favicons) - Downloaded: 1.9GB - DNS errors: 1,668 | Timeouts: 2,129 | HTTP errors: 47,565 | Invalid: 11,803 | Too large: 777 - DB errors: 0 | Panics: 0 --- ## Phase 4: Best Icon Selection & Bundle Generation (Stages 4-5) ### Step 4.1: Best Icon Selection SQL Write `pipeline/04_best_icon/select.sql`: ```sql UPDATE hosts h SET best_icon_s3_key = sub.s3_key FROM ( SELECT DISTINCT ON (i.host_id) i.host_id, i.s3_key FROM icons i WHERE i.scan_state = 'completed' ORDER BY i.host_id, CASE WHEN i.width = i.height AND i.width IN (64, 48, 32, 16) THEN 0 WHEN i.width = i.height AND i.width <= 64 THEN 1 WHEN i.width IS NOT NULL AND i.width <= 64 AND i.height <= 64 THEN 2 ELSE 3 END, COALESCE(i.width, 0) DESC, CASE WHEN i.content_type IN ('image/png', 'image/gif', 'image/x-icon', 'image/vnd.microsoft.icon') THEN 0 WHEN i.content_type = 'image/webp' THEN 1 WHEN i.content_type = 'image/svg+xml' THEN 2 ELSE 3 END, i.file_size ASC ) sub WHERE h.id = sub.host_id; ``` **Validation:** - `SELECT COUNT(*) FROM hosts WHERE best_icon_s3_key IS NOT NULL;` — expect 60-80% of hosts - `SELECT COUNT(*) FROM hosts WHERE html_title IS NOT NULL AND best_icon_s3_key IS NULL;` — hosts with title but no icon (will still be in bundles) - Spot-check: for a few hosts, verify the selected icon is reasonable (correct size, valid) **Stats:** `stats/04_best_icon.json` **Done when:** best_icon_s3_key populated for hosts that have valid icons. ### Step 4.2: Bundle Generator Go Program ``` pipeline/05_bundle_gen/ ├── main.go # Entry point, CLI flags ├── db.go # Query hosts + icon keys ├── convert.go # Icon format conversion → PNG ├── bundle.go # Chunk + serialize JSON └── s3.go # Upload bundles to everytab-site ``` CLI flags: - `--db` connection string - `--icons-bucket` (default `everytab-icons`) - `--site-bucket` (default `everytab-site`) - `--entries-per-bundle` (tunable, start at 120) - `--dry-run` (generate bundles to local disk, don't upload) - `--limit` (only process N hosts, for testing) ### Step 4.3: Icon Conversion Logic Implement format conversion to PNG: 1. Download icon from S3 by key 2. Detect format from magic bytes 3. Decode: - PNG: decode directly - ICO: parse container, extract image at recorded width/height, decode BMP or PNG within - GIF/JPEG/BMP/WebP: decode to RGBA - SVG: rasterize to 32x32 (use a Go SVG library, or shell out to `rsvg-convert` if simpler) 4. Re-encode as PNG (optimized, don't upscale) 5. Base64-encode **Test:** Convert 50 icons of mixed formats manually, verify output PNGs look correct. ### Step 4.4: Bundle Assembly + Upload Implement: 1. Query all hosts WHERE html_title IS NOT NULL, randomize (ORDER BY random()) 2. For each host: fetch + convert its icon (or set empty string if no icon) 3. Assemble entries into chunks of `ENTRIES_PER_BUNDLE` 4. Serialize each chunk as JSON (`tabs/{n}.json`) 5. Upload to S3 `everytab-site/tabs/` 6. Record total bundle count **Dry-run:** Generate bundles to local disk, inspect a few: - Valid JSON - Icons render in browser (paste a data:image/png;base64,... URI) - Entries have host, title, icon, icon_w, icon_h, iframe_ok **Validation:** - Bundle files exist in S3 - `aws s3 ls s3://everytab-site/tabs/ | wc -l` matches expected count - Random bundle can be fetched and parsed as JSON - Total hosts across all bundles = count of hosts with titles **Stats:** `stats/05_bundle_gen.json` **Done when:** All bundles uploaded to S3, JSON is valid, icons render. --- ## Phase 5: Frontend (Stage 6) Begins after Phase 4 is complete — we use real bundle data from the 100K pipeline run for frontend development. ### Step 5.1: Local Dev Server Serve the generated bundles from S3 locally for frontend development: ```bash # Sync a few bundles locally for testing aws s3 sync s3://everytab-site/tabs/ ./local-tabs/ --max-items 10 # Serve with any static file server python -m http.server 8000 ``` **Done when:** Can fetch real bundle JSON from a local dev server. ### Step 5.2: Basic Tab Rendering Build `frontend/index.html` and `frontend/site.js`: 1. HTML: minimal shell with a container div, inline CSS for tab styling 2. JS: fetch a bundle, render tabs as rows filling the viewport 3. Tab appearance: mimic Firefox tab shape (rounded top corners, slight border) 4. Each tab shows favicon (16x16 or 32x32 img from data URI) + truncated title 5. No-icon tabs show title only Focus: get the visual density right. How many tabs fit across? How many rows fill the viewport? This determines `ENTRIES_PER_BUNDLE`. **Done when:** Page renders tabs from a mock bundle. Visually looks like a page full of browser tabs. ### Step 5.3: Marquee Animation Add horizontal marquee to each row: - CSS `@keyframes` animation, translateX - Each row at slightly different speed and direction (some left, some right) - Smooth, subtle movement — not distracting, just enough to feel alive - Rows need extra tabs beyond viewport width to avoid gaps during scroll **Done when:** Rows scroll smoothly, no visual glitches at edges. ### Step 5.4: Interaction — Click, Iframe, Close Implement tab click behavior: 1. If `iframe_ok`: show an overlay with iframe loading the site (`{protocol}://{hostname}`) 2. If `!iframe_ok`: open in new tab (`target="_blank"`, add rel="noopener") 3. Visual indicator on tabs that will open externally (small icon/badge) 4. Close overlay: X button + click-outside + Escape key **Done when:** Clicking tabs works correctly for both iframe and external cases. ### Step 5.5: Infinite Scroll + Random Bundle Loading Implement: 1. Seeded PRNG using `Date.now()` — generates deterministic sequence of bundle indices 2. On page load: fetch first bundle, render 3. Scroll detection: when user approaches bottom, fetch next random bundle 4. Track loaded bundle IDs in a Set (no duplicates) 5. Append new rows below existing ones 6. Handle edge case: all bundles loaded (unlikely with 50K+ bundles but handle gracefully) `TOTAL_BUNDLES` is a constant baked into the JS at build time. **Done when:** Infinite scroll works, new bundles load seamlessly, no duplicate bundles. ### Step 5.6: Frontend Build Script Write `pipeline/06_frontend/build.sh`: 1. Read total bundle count (from pipeline output or S3) 2. Inject `const TOTAL_BUNDLES = {M};` into site.js 3. Copy index.html + site.js to S3 `everytab-site/` 4. Invalidate CloudFront (if distribution exists) **Done when:** Build script produces deployable frontend with correct bundle count. --- ## Phase 6: Integration & End-to-End Test (100K) ### Step 6.1: Run Full Pipeline (100K) Execute all stages in sequence on EC2: 1. Verify hosts table has 100K entries (from Phase 1) 2. Run WARC parser (Phase 2) — should complete in minutes 3. Run icon downloader (Phase 3) — should complete in 10-30 minutes at 100K scale 4. Run best icon selection (Phase 4.1) 5. Run bundle generator (Phase 4.2-4.4) 6. Run frontend build (Phase 5.6) **Validation:** Visit the CloudFront URL. The site should work: - Tabs render with real favicons and titles - Clicking works (iframe + external) - Scrolling loads more tabs - No JS console errors ### Step 6.2: Tune Parameters Based on the 100K run: - **ENTRIES_PER_BUNDLE:** Look at the live site. Does one bundle fill the screen? Too many tabs? Too few? Adjust. - **Concurrency:** Was the icon download memory-stable? CPU-bound or network-bound? Adjust goroutine pool size. - **Timeouts:** What was the error distribution? Are timeouts too aggressive? Too lenient? - **Icon selection:** Do the selected icons look good? Any weird sizes or broken images? Update CLI flag defaults based on findings. ### Step 6.3: Collect & Review Stats Merge all `stats/*.json` into a single pipeline report. Review: - Loss at each stage (domains → parsed → icons downloaded → icons selected → bundled) - Time per stage - Error patterns (are certain TLDs failing more? certain icon formats?) - Storage usage (S3 icons bucket, S3 site bucket) Identify any pipeline bugs or data quality issues. Fix before scaling up. **Done when:** End-to-end works at 100K, parameters tuned, stats reviewed, bugs fixed. --- ## Phase 7: Full-Scale Run (30M) ### Step 7.1: Remove Limits, Re-run CC-Index Query Update the DuckDB query to remove `LIMIT 100000`. Re-run. Considerations: - If httpfs takes >1hr, switch to downloading the parquet files first - May need to increase RDS storage (30M rows with WARC paths ≈ 5-10GB) - Monitor DuckDB memory usage **Validation:** `SELECT COUNT(*) FROM hosts;` shows ~30M rows. ### Step 7.2: Run WARC Parser at Scale Run with full concurrency against 30M hosts. Expected time: 2-6 hours. Monitor: - Throughput (hosts/sec) - Error rate stability (should plateau, not climb) - Postgres connection pool health - Memory usage ### Step 7.3: Run Icon Downloader at Scale This is the long pole — expected 12-48 hours. Monitor continuously: - icons/sec rate - DNS cache hit rate (check Unbound stats: `unbound-control stats`) - S3 upload rate - Error rate by type - Completion percentage If too slow (projected >48hrs): - Consider increasing concurrency (if memory allows) - Consider spinning up fleet (add more EC2 instances running the same binary) - Check if DNS is the bottleneck (Unbound stats) - Check if S3 uploads are the bottleneck (batch or reduce HEAD checks) ### Step 7.4: Best Icon Selection + Bundle Generation Run at full scale. Expected: 1-2 hours total. Monitor bundle sizes — verify they're in the expected range with `ENTRIES_PER_BUNDLE` from tuning. ### Step 7.5: Rebuild Frontend + Deploy Run frontend build with the real bundle count. Invalidate CloudFront. **Validation:** Visit the live site. Browse around. Check: - Tab variety (seeing diverse sites, not just one TLD) - Icon quality (no broken images, reasonable sizes) - Performance (bundles load quickly, no jank) - Stats page / stats.json looks correct **Done when:** Full-scale site is live and working. --- ## Phase 8: Backup & Teardown ### Step 8.1: Backup RDS to Homelab ```bash # On EC2 (fast connection to RDS): pg_dump -Fc $DATABASE_URL > everytab_dump.pgfc # Transfer to homelab (from EC2 or direct): scp everytab_dump.pgfc homelab:/backups/everytab/ # On homelab, verify restore: pg_restore -d everytab_local everytab_dump.pgfc psql everytab_local -c "SELECT COUNT(*) FROM hosts; SELECT COUNT(*) FROM icons;" ``` ### Step 8.2: Backup Icons S3 to Homelab ```bash # From homelab (or EC2 as intermediary): aws s3 sync s3://everytab-icons/ /backups/everytab/icons/ # Verify file count matches: ls /backups/everytab/icons/ | wc -l # Compare with: aws s3 ls s3://everytab-icons/ | wc -l ``` ### Step 8.3: Verify & Teardown After confirming backups: ```bash # Verify the live site still works (it only depends on everytab-site + CloudFront) curl -s https://your-cloudfront-domain.net/ | head # Teardown scanning infrastructure: aws rds delete-db-instance --db-instance-identifier everytab --skip-final-snapshot aws s3 rb s3://everytab-icons --force aws ec2 terminate-instances --instance-ids i-xxxxx ``` **Done when:** Only `everytab-site` S3 bucket + CloudFront remain running. Monthly cost: ~$2-4. --- ## Development Notes ### Execution Order Phases are sequential: 0 → 1 → 2 → 3 → 4 → 5 → 6 → 7 → 8. Frontend (Phase 5) uses real data from the 100K pipeline run. The only thing that can be developed ahead of time is writing Go code locally before EC2 is ready (compile-test locally, run on EC2). ### Progress & Observability All Go programs have two output modes running simultaneously: **Per-item log lines** (stdout, above the progress bar): - WARC parser: `parsed: example.com 200 "Example Domai..." ok` or `parsed: broken.net 200 "" err:no_title` - Icon downloader: `icon: https://example.com/favicon.ico 32x32 png 4.2KB ok` or `icon: https://fail.org/favicon.ico err:timeout` - Bundle generator: `bundle: 0042.json 120 entries 247KB ok` Each line is a short, fixed-format summary — hostname/URL, key result, and status. Keeps it scannable when running live. **Log file** (`--log-file path/to/out.log`): If provided, mirror all per-item log lines to disk. For full-scale runs, consider using `--log-errors-only` flag to only write error lines to the log file (avoids filling disk with 30M success lines). Without `--log-file`, logs only go to stdout. **Progress bar** (bottom of terminal, `schollz/progressbar`): - Items processed / total items - Processing rate (items/sec) - ETA - Error count On completion, each program prints a summary line and writes its stats JSON (with started_at, finished_at, duration_seconds, and stage-specific counters). ### Testing Strategy - **Dry-run flags** on all Go programs: print what would happen without mutating DB/S3 - **--limit flags** on all Go programs: process a small subset quickly - **Spot-checks:** after each stage, manually verify 5-10 random entries - **Stats files:** compare counts between stages to catch data loss - **100K dev set:** full pipeline at small scale before committing to a 24hr+ full run ### Common Pitfalls to Watch For - **DuckDB CC-Index path:** The exact S3 path to parquet files changes per crawl. Check Common Crawl's website for the latest crawl ID and index location. - **WARC record format:** WARC records have a specific envelope format (WARC/1.0 header, blank line, HTTP response). Don't assume the HTTP response starts at byte 0. - **Relative icon URLs:** `/favicon.ico` is relative to root, but `favicon.ico` (no leading slash) is relative to the page path. Since we only have root pages (`/`), both resolve the same. But `../icons/fav.png` could be tricky — handle gracefully or skip. - **ICO files are complex:** The ICO container format can embed BMP (with a modified header) or PNG. Many "ICO" files are actually just PNGs renamed to .ico. Check magic bytes, not file extension. - **SVG rasterization:** Go doesn't have great native SVG support. Consider shelling out to `rsvg-convert` or `librsvg`, or use a Go library like `github.com/nicholasgasior/goresvg`. This can be a follow-up if SVG icons are rare. - **Postgres connection limits:** RDS db.t3.medium has max_connections ≈ 80. With 1000 goroutines, we need connection pooling (pgx pool handles this). Set pool max to ~40 connections. - **S3 eventual consistency:** After uploading an icon, a HEAD request might not find it immediately. For dedup checks, handle "not found" gracefully (just upload again — idempotent since key is content hash). - **CloudFront caching:** After deploying new bundles, invalidate `/*` or set short TTL during development. For production, use long TTLs (bundles are immutable between crawls). --- ## Progress Log ### Phase 0 — Completed 2026-05-17 **Changes from original plan:** - Replaced shell scripts (`setup.sh`, `teardown.sh`) with Terraform (`infra/main.tf`). Single file, `var.scanning` bool switches between scanning and serving phases. - SSH key is Terraform-managed (no passphrase, stored in state) rather than manually generated. - CloudFront distribution deferred — not created in Phase 0, will add to Terraform when frontend is ready. - Added `infra/README.md` with terse setup steps for future replication. **Lessons learned:** - Shell scripts with `2>/dev/null || echo "already exists"` swallow real errors. Terraform's declarative model avoids this entirely — errors are always surfaced. - RDS requires a DB subnet group (2+ subnets in different AZs). The original shell script didn't create one, causing a silent failure. Terraform handles this dependency automatically. - Amazon Linux 2023 uses `systemd-resolved` which manages `/etc/resolv.conf`. Must disable it before pointing resolv.conf at Unbound. `chattr +i` doesn't work on the symlink. - AWS EC2 key pairs created via API don't support passphrases. Use `tls_private_key` in Terraform or generate locally with `ssh-keygen` + import. - When an AWS key pair name already exists from a previous run, Terraform may not regenerate it. Use `-replace` to force recreation of the key + instance together. ### Phase 1 (Steps 1.1-1.2) — Completed 2026-05-17 **Changes from original plan:** - Used DuckDB `aws` extension with `CREDENTIAL_CHAIN` instead of httpfs anonymous access. The commoncrawl S3 bucket requires authenticated requests. - IAM role needed explicit `s3:GetObject` and `s3:ListBucket` on `arn:aws:s3:::commoncrawl/*` — the bucket doesn't allow cross-account access based on bucket policy alone. - Used `GROUP BY` with `first(... ORDER BY ...)` instead of `ROW_NUMBER()` window function. More memory-efficient (hash aggregation vs sort), cleaner syntax. - DuckDB can glob `s3://.../subset=warc/*.parquet` directly (300 files) — no need to fetch a file list or download parquet locally. - Dropped the `url_port IN (80, 443)` filter — CC stores standard ports as NULL, not 80/443. Replaced with `url_port IS NULL`. **Lessons learned:** - DuckDB URL-encodes `=` in S3 paths (e.g., `crawl%3DCC-MAIN-2026-17`) but S3 decodes it correctly. The real issue was always IAM permissions, not path encoding. - The `commoncrawl` S3 bucket requires valid AWS credentials for both GetObject and ListBucket. Anonymous access (unsigned requests) does not work. Any valid IAM identity works as long as their policy allows it. - DuckDB's LIMIT can interact unexpectedly with GROUP BY — the optimizer may stop reading input early once it has enough groups. This wasn't our issue (it was the port filter) but worth noting for future queries. - CC-Index stores `url_port` as NULL for standard ports (80/443), not as the integer. Always check actual column values before writing filters. - c5.xlarge (8GB) is tight for this query — uses 6.4GB + swap. For the full 30M run, use c5.2xlarge (16GB). - Query takes ~692s (11.5 min) for 100K output rows reading all 300 parquet files. Full run without LIMIT will be similar duration but more memory for the hash table. ### Phase 2 — Completed 2026-05-17 **Changes from original plan:** - Used AWS SDK S3 GetObject for WARC byte-range requests instead of HTTPS to `data.commoncrawl.org`. The HTTPS endpoint rate-limits at ~100 concurrent connections (429s). S3 has no such limit. - Removed progress bar — it interfered with per-host log lines. Replaced with clean stdout log lines + summary at end. Check DB for mid-run progress. - Added `process.go` and `log.go` files (plan had 4 files, we have 6 — cleaner separation). - Added charset detection + UTF-8 conversion (`golang.org/x/net/html/charset` + `golang.org/x/text/transform`) for international titles. - Added `strings.ToValidUTF8` sanitization as final safety net for titles that still have invalid bytes after charset conversion. - Panic recovery per goroutine — logs `PANIC:` prefix, doesn't mark row as parsed (retryable on next run). - DB write errors tracked separately (`DB_ERROR:` prefix, counted in summary + stats JSON). **Lessons learned:** - `data.commoncrawl.org` aggressively rate-limits (403/429) at ~100 concurrent connections. Use S3 API directly for high-concurrency access. - Many Chinese/Japanese sites serve GBK or other non-UTF-8 encodings without declaring it in Content-Type or `<meta>`. `charset.DetermineEncoding` catches most but not all. `strings.ToValidUTF8` as final sanitization prevents Postgres encoding errors. - gowarc's `HttpHeader()` can return nil for malformed records — always nil-check library return values defensively. - Increasing concurrency from 100 to 500 didn't improve throughput (~300 hosts/sec either way). The bottleneck is likely Postgres write latency or S3 per-connection bandwidth, not parallelism. Could investigate batch inserts for the full run. - Progress bars and per-item log lines don't mix well in terminals. Pick one or write progress to a separate channel (file, stderr). ### Phase 3 — Completed 2026-05-18 **Changes from original plan:** - Filtered eligible icons before downloading: skip link_rel icons with declared size >64x64 (apple-touch-icon bloat). Reduced download count from ~302K to ~224K. - Channel-based worker pool instead of semaphore pattern — producer goroutine feeds work channel, N workers consume. No starvation between batch claims. - Shared http.Transport for connection pooling (marginal benefit since hosts are unique, but reduces GC pressure). - No progress bar — same approach as Phase 2 (log lines + summary). - User-Agent set to `EveryTabBot/1.0` with link to `everytab.site/bot` for bot identification. **Lessons learned:** - 70% icon download success rate is expected — most failures are 404s from domains/pages that changed since the crawl. This is acceptable loss. - 25% dedup rate — many hosted platforms (Wix, WordPress.com, Squarespace) serve identical default favicons. Content-addressed S3 storage handles this efficiently. - `data.commoncrawl.org` rate-limits HTTPS but S3 does not — same pattern as WARC parsing. Use S3 API for all CC access. - Favicon download is I/O bound (network latency to diverse hosts worldwide). Concurrency helps up to a point, then the long tail of slow/dead servers dominates. 351 icons/sec at 200 concurrency. - Invalid image detection (magic bytes) catches ~5% of "successful" downloads that are actually HTML error pages served at `/favicon.ico`. --- ## Future Improvements - **WARC parser: retry on fetch errors** — Currently 3 fetch errors out of 100K (tolerable loss). Could add 1 retry with backoff for transient S3 errors. - **WARC parser: batch DB inserts** — Currently one INSERT per icon. Using pgx batch or CopyFrom could improve DB write throughput and potentially unblock higher concurrency. - **WARC parser: investigate throughput ceiling** — 300 hosts/sec at both 100 and 500 concurrency suggests a bottleneck. Profile to determine if it's S3 response latency, Postgres writes, or something else. For the full 30M run this determines wall-clock time (~28 hours at current rate). - **CC-Index query: c5.2xlarge for full run** — 8GB is tight with 6.4GB usage + swap. 16GB instance for the 30M-host full run. - **Encoding: investigate remaining garbled titles** — Some titles still show `�` in output (e.g., `BERGSTRANDS BAGERI �...`). These are pages that lie about their encoding. Could try more aggressive charset detection heuristics. - **Icon download: retry transient failures** — DNS and timeout failures could benefit from a single retry. Would recover a small percentage of icons. - **Icon download: download large link_rel icons** — Currently skipping declared sizes >64x64. Re-run with broader filter for future high-res projects.