everytab/PLAN.md

47 KiB
Raw Permalink Blame History

EveryTab Implementation Plan

This plan builds the system described in ARCHITECTURE.md in incremental steps. We start with 100K hosts to validate the pipeline end-to-end, then scale to the full ~30M.

Each step has a clear deliverable and validation criteria. Steps are sequential — each phase builds on the previous.


Phase 0: Project Setup & AWS Infrastructure [COMPLETED]

Step 0.1: Repository Structure [COMPLETED]

everytab/
├── design.md
├── ARCHITECTURE.md
├── PLAN.md
├── infra/
│   ├── main.tf              # Terraform: all AWS resources
│   ├── terraform.tfvars.example
│   ├── ec2-userdata.sh      # EC2 bootstrap (Go, DuckDB, Unbound)
│   └── README.md            # Setup steps
├── pipeline/
│   ├── 01_cc_index/
│   │   └── schema.sql      # Postgres table definitions
│   ├── 02_warc_parse/
│   ├── 03_icon_download/
│   ├── 04_best_icon/
│   ├── 05_bundle_gen/
│   └── 06_frontend/
├── frontend/
├── stats/                   # gitignored
└── go.mod

Step 0.2: AWS Infrastructure (Terraform) [COMPLETED]

Infrastructure managed via infra/main.tf. Single file, uses var.scanning bool to switch phases:

  • terraform apply — creates all scanning resources (EC2, RDS, S3 icons, S3 site, IAM, security groups)
  • terraform apply -var="scanning=false" — destroys scanning resources, keeps site bucket
  • terraform destroy — removes everything

Resources created:

  • S3 everytab-icons (private), S3 everytab-site (for CloudFront later)
  • RDS Postgres 16, db.t3.medium, 20GB gp3
  • EC2 c5.xlarge, Amazon Linux 2023, 50GB gp3
  • Security groups (SSH from home IP, RDS from EC2 only)
  • IAM role + instance profile (S3 access only)
  • SSH key (Terraform-managed ed25519)

Step 0.3: EC2 Environment Setup [COMPLETED]

Bootstrap via infra/ec2-userdata.sh:

  • Go 1.22+, DuckDB (httpfs + postgres extensions), Unbound (recursive resolver), psql, tmux
  • Unbound configured as system resolver (systemd-resolved disabled)
  • DATABASE_URL in .bashrc
  • Schema applied: hosts + icons tables with indexes

Phase 1: CC-Index Query (Stage 1)

Step 1.1: Database Schema

Create the Postgres tables. Run via psql:

CREATE TABLE hosts (
    id SERIAL PRIMARY KEY,
    hostname TEXT NOT NULL UNIQUE,
    protocol TEXT NOT NULL,
    crawl_id TEXT NOT NULL,
    warc_filename TEXT NOT NULL,
    warc_record_offset BIGINT NOT NULL,
    warc_record_length INT NOT NULL,
    html_title TEXT,
    iframe_allowed BOOLEAN,
    best_icon_s3_key TEXT,
    parsed BOOLEAN DEFAULT FALSE
);

CREATE TABLE icons (
    id SERIAL PRIMARY KEY,
    host_id INT NOT NULL REFERENCES hosts(id),
    url TEXT NOT NULL,
    source TEXT NOT NULL,
    rel_type TEXT,
    rel_sizes TEXT,
    content_type TEXT,
    width INT,
    height INT,
    file_size INT,
    s3_key TEXT,
    scan_state TEXT DEFAULT 'unscanned',
    error TEXT
);

CREATE INDEX idx_hosts_parsed ON hosts(id) WHERE parsed = FALSE;
CREATE INDEX idx_icons_unscanned ON icons(id) WHERE scan_state = 'unscanned';
CREATE INDEX idx_icons_host_id ON icons(host_id);

Done when: Tables exist in RDS, schema matches ARCHITECTURE.md.

Step 1.2: DuckDB CC-Index Query (100K limit) [COMPLETED]

Script: pipeline/01_cc_index/query.sh

Uses DuckDB with aws extension (credential chain) to read parquet directly from s3://commoncrawl/.../*.parquet glob, with the postgres extension to write results into RDS. Auto-detects latest crawl ID from the CC API.

Deduplication via GROUP BY url_host_name with first(... ORDER BY ...) aggregates (hash aggregation — more memory-efficient than window functions).

Result: 100K hosts, 77% https / 23% http, completed in 692s.

Done when: 100K hosts in the database with valid WARC coordinates.

Step 1.3: Validate WARC Coordinates [COMPLETED]

Manually fetched WARC records with curl byte-range requests to data.commoncrawl.org. Confirmed valid WARC headers, HTTP response, and HTML with <title> and <link rel="icon"> tags.


Phase 2: WARC Parsing (Stage 2) [COMPLETED]

Steps 2.1-2.3 [COMPLETED]

Binary: pipeline/02_warc_parse/ (5 files: main.go, warc.go, parser.go, process.go, db.go, log.go)

Architecture:

  • Fetches WARC records via AWS SDK S3 byte-range GetObject (using EC2 instance profile credentials)
  • Parses WARC records with github.com/nlnwa/gowarc/v3
  • Parses HTML with golang.org/x/net/html tokenizer (lenient, stops at <body>)
  • Detects charset via golang.org/x/net/html/charset and converts to UTF-8
  • Sanitizes titles with strings.ToValidUTF8 as final safety net
  • Concurrent goroutine pool with configurable concurrency
  • Per-host log lines to stdout + optional log file
  • Panic recovery per goroutine (logs PANIC, doesn't mark row as parsed)
  • DB errors tracked and logged with DB_ERROR: prefix

CLI: ./warc_parse --db URL [--concurrency N] [--batch-size N] [--limit N] [--dry-run] [--log-file PATH] [--log-errors-only]

Result (100K hosts, concurrency 500):

  • Duration: 5m31s (~300 hosts/sec)
  • Titles found: 93,384 (93%)
  • Icons found: 201,780 (~2 per host)
  • Iframe blocked: 17,855 (18%)
  • Fetch errors: 3
  • DB errors: 0
  • Panics: 0

Phase 3: Icon Download (Stage 3) [COMPLETED]

Steps 3.1-3.3 [COMPLETED]

Binary: pipeline/03_icon_download/ (6 files: main.go, download.go, image.go, s3.go, db.go, log.go)

Architecture:

  • Channel-based work distribution: producer goroutine claims batches, N worker goroutines consume from buffered channel (no worker starvation)
  • Shared http.Transport for connection pooling / TLS session reuse
  • Content-addressed S3 storage (SHA-256 hash as key, dedup via HeadObject before upload)
  • Magic byte validation (PNG, GIF, JPEG, ICO, BMP, WebP, SVG)
  • ICO directory parsing for dimensions (picks largest ≤64x64)
  • Filters to eligible icons only: favicon_ico + link_rel with no declared size or ≤64x64
  • md5(id) shuffle in claim query to spread requests across hosts
  • Panic recovery per worker, DB errors tracked and logged

CLI: ./icon_download --db URL [--s3-bucket NAME] [--concurrency N] [--batch-size N] [--timeout D] [--max-size N] [--limit N] [--dry-run] [--log-file PATH] [--log-errors-only]

Result (100K hosts, ~224K eligible icons):

  • Duration: 10m36s (351 icons/sec)
  • Completed: 156,214 (70%)
  • Failed: 67,459 (30% — mostly HTTP 404s from stale crawl data)
  • Dedup hits: 55,771 (25% — shared Wix/WordPress/hosted platform favicons)
  • Downloaded: 1.9GB
  • DNS errors: 1,668 | Timeouts: 2,129 | HTTP errors: 47,565 | Invalid: 11,803 | Too large: 777
  • DB errors: 0 | Panics: 0

Phase 4: Best Icon Selection & Bundle Generation (Stages 4-5) [COMPLETED]

Step 4.1: Best Icon Selection SQL [COMPLETED]

Script: pipeline/04_best_icon/select.sql

Selects the best icon per host using DISTINCT ON with priority ordering. Excludes SVGs (can't rasterize) and ≤2x2 icons (tracking pixels). See ARCHITECTURE.md for the full decision flow.

Result: 70,366 hosts got an icon (72%), 23,018 have title but no icon.

Steps 4.2-4.4: Bundle Generator [COMPLETED]

Binary: pipeline/05_bundle_gen/ (6 files: main.go, bundle.go, convert.go, db.go, s3.go, log.go)

Architecture:

  • Queries all hosts with titles (randomized), concurrently downloads best icon from S3 icons bucket
  • Uses github.com/biessek/golang-ico for ICO decoding (handles all bit depths including palette-based 1/4/8bpp)
  • image.Decode handles PNG/GIF/JPEG/WebP/BMP/ICO via registered decoders. SVGs excluded.
  • Icons >128px downscaled to 32x32 (nearest-neighbor). Icons ≤128px kept as-is.
  • Re-encodes all icons as PNG, base64-encoded inline in bundle JSON.
  • Panic recovery per icon conversion (malformed ICO files in the library)
  • Concurrent S3 downloads with configurable concurrency (default 50)

CLI: ./bundle_gen --db URL [--icons-bucket NAME] [--site-bucket NAME] [--entries-per-bundle N] [--concurrency N] [--limit N] [--dry-run] [--output-dir DIR] [--log-file PATH] [--log-errors-only]

Result (93K hosts with titles, 70K with icons):

  • Duration: 1m30s
  • Bundles created: 779 (120 entries each, last bundle partial)
  • Total size: 165MB (avg 216KB per bundle)
  • Convert errors: 1,263 (1,077 SVGs + 186 other — panics, truncated files, corrupt GIFs)
  • S3: 779 JSON files in everytab-site/tabs/

Phase 5: Frontend (Stage 6) [COMPLETED — v1]

Steps 5.1-5.6 [COMPLETED]

Files: frontend/index.html and frontend/site.js

Architecture:

  • Vanilla JS, no framework. Two files: HTML (with inline CSS) + JS.
  • Fetches random bundle JSONs from tabs/{N}.json, renders tabs as rows filling the viewport.
  • Seeded PRNG (Date.now() + mulberry32) — every visitor sees unique tab arrangement.
  • Infinite scroll: loads more bundles as user approaches the bottom.
  • Tracks loaded bundle IDs in a Set to avoid duplicates.

Tab rendering:

  • Browser-specific tab styling via navigator.userAgent detection (Chrome, Firefox, Safari).
  • Inactive tab appearance by default, selected/active style when iframe is open.
  • Light mode default, auto-switches to dark mode via prefers-color-scheme.
  • Bidirectional marquee: each row randomly scrolls left or right at different speeds (90-150s per cycle).
  • Tabs duplicated in DOM for seamless marquee loop (translateX(-50%)).
  • Hover shows full title as native tooltip.
  • External link indicator (↗) on tabs that don't allow iframes.

Iframe viewer:

  • Inline, not overlay — opens between tab rows, pushes content down (75vh height).
  • Header shows favicon, title, external link, and close button.
  • Sandboxed iframe (allow-scripts allow-same-origin allow-forms).
  • Close via X button, Escape key.
  • Only one viewer open at a time.

TOTAL_BUNDLES baked into HTML at build time. Build script (pipeline/06_frontend/build.sh) still TODO — currently hardcoded.


Phase 6: Integration & End-to-End Test (100K) [COMPLETED]

Steps 6.1-6.3 [COMPLETED]

Full clean end-to-end run from terraform apply to live site at everytab.site.

Pipeline timing (100K hosts):

Stage Duration
CC-Index query 13m11s
WARC parsing 4m55s
Icon download 10m39s
Best icon selection instant
Bundle generation 1m32s
Frontend deploy seconds
Total pipeline ~31 minutes

Loss funnel:

100,000 hosts
 → 93,432 with titles (6.6% loss)
 → 70,551 with icons selected (24.4% loss)
 → 69,306 with icons in bundles (1.8% convert errors)
 → 779 bundles, 165MB total, avg 217KB per bundle

Parameter review:

  • ENTRIES_PER_BUNDLE = 120 — fills the screen well, kept as-is
  • Icon download concurrency 200 — I/O bound at 350 icons/sec, increasing doesn't help
  • Timeouts 10s — good balance, 2,261 timeouts (1%) is acceptable
  • Icons look good on the live site

Infrastructure notes:

  • c5.xlarge (8GB) needs 4GB swap for the DuckDB query — gets OOM killed without it
  • DuckDB occasionally gets S3 503 (rate limit) — retry works
  • force_destroy = true on icons bucket needed for clean teardown
  • deploy.sh sed pattern must match [0-9]* not .* to avoid eating </script>

Development Notes

Execution Order

Phases are sequential: 0 → 1 → 2 → 3 → 4 → 5 → 6 Frontend (Phase 5) uses real data from the 100K pipeline run. The only thing that can be developed ahead of time is writing Go code locally before EC2 is ready (compile-test locally, run on EC2).

Progress & Observability

All Go programs have two output modes running simultaneously:

Per-item log lines (stdout, above the progress bar):

  • WARC parser: parsed: example.com 200 "Example Domai..." ok or parsed: broken.net 200 "" err:no_title
  • Icon downloader: icon: https://example.com/favicon.ico 32x32 png 4.2KB ok or icon: https://fail.org/favicon.ico err:timeout
  • Bundle generator: bundle: 0042.json 120 entries 247KB ok

Each line is a short, fixed-format summary — hostname/URL, key result, and status. Keeps it scannable when running live.

Log file (--log-file path/to/out.log): If provided, mirror all per-item log lines to disk. For full-scale runs, consider using --log-errors-only flag to only write error lines to the log file (avoids filling disk with 30M success lines). Without --log-file, logs only go to stdout.

Progress bar (bottom of terminal, schollz/progressbar):

  • Items processed / total items
  • Processing rate (items/sec)
  • ETA
  • Error count

On completion, each program prints a summary line and writes its stats JSON (with started_at, finished_at, duration_seconds, and stage-specific counters).

Testing Strategy

  • Dry-run flags on all Go programs: print what would happen without mutating DB/S3
  • --limit flags on all Go programs: process a small subset quickly
  • Spot-checks: after each stage, manually verify 5-10 random entries
  • Stats files: compare counts between stages to catch data loss
  • 100K dev set: full pipeline at small scale before committing to a 24hr+ full run

Common Pitfalls to Watch For

  • DuckDB CC-Index path: The exact S3 path to parquet files changes per crawl. Check Common Crawl's website for the latest crawl ID and index location.
  • WARC record format: WARC records have a specific envelope format (WARC/1.0 header, blank line, HTTP response). Don't assume the HTTP response starts at byte 0.
  • Relative icon URLs: /favicon.ico is relative to root, but favicon.ico (no leading slash) is relative to the page path. Since we only have root pages (/), both resolve the same. But ../icons/fav.png could be tricky — handle gracefully or skip.
  • ICO files are complex: The ICO container format can embed BMP (with a modified header) or PNG. Many "ICO" files are actually just PNGs renamed to .ico. Check magic bytes, not file extension.
  • SVG rasterization: Go doesn't have great native SVG support. Consider shelling out to rsvg-convert or librsvg, or use a Go library like github.com/nicholasgasior/goresvg. This can be a follow-up if SVG icons are rare.
  • Postgres connection limits: RDS db.t3.medium has max_connections ≈ 80. With 1000 goroutines, we need connection pooling (pgx pool handles this). Set pool max to ~40 connections.
  • S3 eventual consistency: After uploading an icon, a HEAD request might not find it immediately. For dedup checks, handle "not found" gracefully (just upload again — idempotent since key is content hash).
  • CloudFront caching: After deploying new bundles, invalidate /* or set short TTL during development. For production, use long TTLs (bundles are immutable between crawls).

Progress Log

Phase 0 — Completed 2026-05-17

Changes from original plan:

  • Replaced shell scripts (setup.sh, teardown.sh) with Terraform (infra/main.tf). Single file, var.scanning bool switches between scanning and serving phases.
  • SSH key is Terraform-managed (no passphrase, stored in state) rather than manually generated.
  • CloudFront distribution deferred — not created in Phase 0, will add to Terraform when frontend is ready.
  • Added infra/README.md with terse setup steps for future replication.

Lessons learned:

  • Shell scripts with 2>/dev/null || echo "already exists" swallow real errors. Terraform's declarative model avoids this entirely — errors are always surfaced.
  • RDS requires a DB subnet group (2+ subnets in different AZs). The original shell script didn't create one, causing a silent failure. Terraform handles this dependency automatically.
  • Amazon Linux 2023 uses systemd-resolved which manages /etc/resolv.conf. Must disable it before pointing resolv.conf at Unbound. chattr +i doesn't work on the symlink.
  • AWS EC2 key pairs created via API don't support passphrases. Use tls_private_key in Terraform or generate locally with ssh-keygen + import.
  • When an AWS key pair name already exists from a previous run, Terraform may not regenerate it. Use -replace to force recreation of the key + instance together.

Phase 1 (Steps 1.1-1.2) — Completed 2026-05-17

Changes from original plan:

  • Used DuckDB aws extension with CREDENTIAL_CHAIN instead of httpfs anonymous access. The commoncrawl S3 bucket requires authenticated requests.
  • IAM role needed explicit s3:GetObject and s3:ListBucket on arn:aws:s3:::commoncrawl/* — the bucket doesn't allow cross-account access based on bucket policy alone.
  • Used GROUP BY with first(... ORDER BY ...) instead of ROW_NUMBER() window function. More memory-efficient (hash aggregation vs sort), cleaner syntax.
  • DuckDB can glob s3://.../subset=warc/*.parquet directly (300 files) — no need to fetch a file list or download parquet locally.
  • Dropped the url_port IN (80, 443) filter — CC stores standard ports as NULL, not 80/443. Replaced with url_port IS NULL.

Lessons learned:

  • DuckDB URL-encodes = in S3 paths (e.g., crawl%3DCC-MAIN-2026-17) but S3 decodes it correctly. The real issue was always IAM permissions, not path encoding.
  • The commoncrawl S3 bucket requires valid AWS credentials for both GetObject and ListBucket. Anonymous access (unsigned requests) does not work. Any valid IAM identity works as long as their policy allows it.
  • DuckDB's LIMIT can interact unexpectedly with GROUP BY — the optimizer may stop reading input early once it has enough groups. This wasn't our issue (it was the port filter) but worth noting for future queries.
  • CC-Index stores url_port as NULL for standard ports (80/443), not as the integer. Always check actual column values before writing filters.
  • c5.xlarge (8GB) is tight for this query — uses 6.4GB + swap. For the full 30M run, use c5.2xlarge (16GB).
  • Query takes ~692s (11.5 min) for 100K output rows reading all 300 parquet files. Full run without LIMIT will be similar duration but more memory for the hash table.

Phase 2 — Completed 2026-05-17

Changes from original plan:

  • Used AWS SDK S3 GetObject for WARC byte-range requests instead of HTTPS to data.commoncrawl.org. The HTTPS endpoint rate-limits at ~100 concurrent connections (429s). S3 has no such limit.
  • Removed progress bar — it interfered with per-host log lines. Replaced with clean stdout log lines + summary at end. Check DB for mid-run progress.
  • Added process.go and log.go files (plan had 4 files, we have 6 — cleaner separation).
  • Added charset detection + UTF-8 conversion (golang.org/x/net/html/charset + golang.org/x/text/transform) for international titles.
  • Added strings.ToValidUTF8 sanitization as final safety net for titles that still have invalid bytes after charset conversion.
  • Panic recovery per goroutine — logs PANIC: prefix, doesn't mark row as parsed (retryable on next run).
  • DB write errors tracked separately (DB_ERROR: prefix, counted in summary + stats JSON).

Lessons learned:

  • data.commoncrawl.org aggressively rate-limits (403/429) at ~100 concurrent connections. Use S3 API directly for high-concurrency access.
  • Many Chinese/Japanese sites serve GBK or other non-UTF-8 encodings without declaring it in Content-Type or <meta>. charset.DetermineEncoding catches most but not all. strings.ToValidUTF8 as final sanitization prevents Postgres encoding errors.
  • gowarc's HttpHeader() can return nil for malformed records — always nil-check library return values defensively.
  • Increasing concurrency from 100 to 500 didn't improve throughput (~300 hosts/sec either way). The bottleneck is likely Postgres write latency or S3 per-connection bandwidth, not parallelism. Could investigate batch inserts for the full run.
  • Progress bars and per-item log lines don't mix well in terminals. Pick one or write progress to a separate channel (file, stderr).

Phase 3 — Completed 2026-05-18

Changes from original plan:

  • Filtered eligible icons before downloading: skip link_rel icons with declared size >64x64 (apple-touch-icon bloat). Reduced download count from ~302K to ~224K.
  • Channel-based worker pool instead of semaphore pattern — producer goroutine feeds work channel, N workers consume. No starvation between batch claims.
  • Shared http.Transport for connection pooling (marginal benefit since hosts are unique, but reduces GC pressure).
  • No progress bar — same approach as Phase 2 (log lines + summary).
  • User-Agent set to EveryTabBot/1.0 with link to everytab.site/bot for bot identification.

Lessons learned:

  • 70% icon download success rate is expected — most failures are 404s from domains/pages that changed since the crawl. This is acceptable loss.
  • 25% dedup rate — many hosted platforms (Wix, WordPress.com, Squarespace) serve identical default favicons. Content-addressed S3 storage handles this efficiently.
  • data.commoncrawl.org rate-limits HTTPS but S3 does not — same pattern as WARC parsing. Use S3 API for all CC access.
  • Favicon download is I/O bound (network latency to diverse hosts worldwide). Concurrency helps up to a point, then the long tail of slow/dead servers dominates. 351 icons/sec at 200 concurrency.
  • Invalid image detection (magic bytes) catches ~5% of "successful" downloads that are actually HTML error pages served at /favicon.ico.

Phase 4 — Completed 2026-05-18

Changes from original plan:

  • Used github.com/biessek/golang-ico instead of hand-rolled ICO decoder. Handles all bit depths (1/4/8/24/32bpp) correctly. Eliminated ~20 ICO decode errors from the hand-rolled version.
  • SVGs excluded from best-icon selection (can't rasterize without external deps). SVG-only hosts show up with no icon instead of failing at conversion time.
  • Added ≤2x2 pixel exclusion from best-icon selection (tracking pixels / garbage favicons).
  • Icons >128px downscaled to 32x32 during bundle generation. Icons ≤128px (including 80x80) kept as-is — browser CSS handles display scaling.
  • Added panic recovery around icon conversion (the ICO library panics on some malformed files).
  • Added concurrency for S3 icon downloads during bundle generation (was single-threaded, now 50 concurrent).

Lessons learned:

  • Many hosts (28%) have no usable favicon at all — their /favicon.ico returns HTML or 404, and they have no link rel="icon". These appear in bundles title-only.
  • The golang-ico library panics on certain malformed ICO files (index out of bounds). Third-party decoders need panic recovery wrappers.
  • 80x80 icons are overwhelmingly one single default favicon shared by a hosting platform (~4,276 sites share one hash). Content-addressed storage handles this.
  • Bundle sizes are very heterogeneous (39KB to 198KB) due to icon size variance. Average 216KB is well within our target.
  • SVG favicons are ~3.5% of downloaded icons (5,128 out of 156K). Supporting SVG rasterization would recover ~1,077 hosts. Deferred to future improvement.

Phase 5 — Completed 2026-05-18

Changes from original plan:

  • Inline iframe viewer instead of full-screen overlay. Opens between tab rows, pushes content down (75vh).
  • Browser-specific tab styling (Chrome/Firefox/Safari) via userAgent detection — original plan deferred this to v2.
  • Light/dark mode via prefers-color-scheme — original plan just targeted Firefox dark theme.
  • No progress bar in any Go program — per-item log lines + summary at end is the pattern across the project.
  • TOTAL_BUNDLES hardcoded in HTML for now — build script (Step 5.6) still TODO.

Lessons learned:

  • CSS marquee with alternating directions needs care: right-scrolling rows must start at translateX(-50%) and animate to 0, not the reverse. Both directions use the same duplicated DOM structure.
  • width: max-content on the tab row is essential — without it, flex container constrains to viewport width and percentage-based translateX is wrong.
  • Tab hover expansion (removing max-width) causes layout shifts that make neighboring tabs impossible to click. Native tooltip (tab.title) is simpler and has no side effects.
  • Hundreds of animating DOM elements cause frame drops on weaker GPUs. will-change: transform helps but slower animation speeds help more.

Future Improvements

Phase 6 — Completed 2026-05-18

Changes from original plan:

  • Added 4GB swap file to EC2 bootstrap — DuckDB OOM kills without it on c5.xlarge.
  • Added force_destroy = true to icons S3 bucket — terraform teardown fails otherwise when bucket has objects.
  • Pipeline README with full sanity checks between each stage.
  • Deploy script (pipeline/06_frontend/deploy.sh) automates frontend upload + CloudFront invalidation.
  • CloudFront + ACM certificate + S3 bucket policy added to Terraform. Domain setup (Gandi ALIAS record) is one-time manual step.

Lessons learned:

  • Pipeline is reproducible — second clean run produced nearly identical numbers to the first.
  • DuckDB gets S3 503 errors occasionally (rate limiting). Retry works. May need SET threads = 4 or retry logic for the full 30M run.
  • ACM certificate validation is a chicken-and-egg with CloudFront — use aws_acm_certificate_validation resource to make Terraform wait for DNS validation before creating the distribution.
  • deploy.sh sed must match [0-9]* not .* — the greedy match eats the closing </script> tag.
  • Total wall-clock from terraform apply to live site: ~45 minutes (including bootstrap).

Phase 7 (in progress) — 2026-05-19

Changes from original plan:

  • Switched icon storage from S3 to local disk with sharded directories (ab/cd/ef/hash). Eliminates ~$175 in S3 PUT costs at 30M scale.
  • Downloading ALL icons (removed size filter) — full archive for posterity, filter at bundle generation time.
  • EBS bumped from 300GB to 1TB to hold full icon archive.
  • Dropped ORDER BY md5(id::text) from icon claim query — was causing multi-second query times at 3M+ rows, creating 30-second burst/stall cycles. Without ORDER BY, query is instant and workers stay saturated.
  • Batch size 200 → 5000, channel buffer = batch size. Fewer DB round-trips, workers always have work.
  • Pin EC2 AMI in tfvars to prevent Terraform from replacing the instance when Amazon publishes a new AMI.
  • Added CloudFront access logging to S3 (everytab-logs bucket).

Lessons learned:

  • ORDER BY md5(id::text) is O(n) on the unscanned set — fine at 100K, catastrophic at 3M+. The md5 shuffle for "good crawler" behavior is unnecessary when downloading ~2 icons per host across 30M hosts.
  • Icon download throughput improved from 350/s to 408/s just by fixing the claim query bottleneck. The burst/stall pattern on iftop was the key diagnostic.
  • S3 503 "SlowDown" errors are per-bucket-partition, not per-client. At concurrency 100 against commoncrawl bucket, hit 2% error rate. Concurrency 50 eliminated them.
  • Terraform data.aws_ami lookup fetches the latest AMI every apply. Pin it in tfvars to avoid unexpected instance replacement during non-EC2 changes.
  • EBS cost is negligible ($10 for 1TB × 4 days) compared to S3 PUT costs ($175 for 35M PUTs). Local disk is always cheaper for temporary working storage.

3M run results:

  • CC-Index: 3M hosts, ~13min
  • WARC parsing: 3M hosts, ~3hrs at concurrency 50 (reduced from 100 due to S3 503s)
  • Icon download: 6.4M icons, 4h21m at 408/s, 70% success, 53GB downloaded
  • Best icon selection: instant
  • Bundle generation: 2.7M hosts, 1h23m (~540 hosts/sec), 22,429 bundles, 4.7GB total, 215KB avg
  • Frontend deployed with 22,429 bundles to everytab.site

Bundle gen issues identified:

  • OOM at 3M scale — original design loaded all hosts + base64 icons into memory. Fixed: streaming pagination (6000 hosts per page, write bundles incrementally).
  • Fit-and-start pattern: all 6000 conversions must complete before any bundles are written. One slow ICO decode blocks the entire page. S3 uploads are sequential.
  • CPU underutilized — PNG encoding is CPU-bound but the batch-convert-then-write pattern means cores idle during DB fetches and S3 uploads.
  • At current rate (540 hosts/sec), 30M hosts would take ~15hrs for bundle gen alone. Pipeline redesign needed.

Additional lessons learned:

  • random_order column with DEFAULT random() on hosts table enables shuffled bundle generation without expensive ORDER BY. Indexed for fast pagination.
  • Bundle sizes are remarkably consistent across scales (215-217KB avg at 100K, 3M). The entries-per-bundle = 120 parameter is well-tuned.
  • Convert errors at 3M: 4,124 (0.2%) — mostly SVGs and malformed ICOs. Acceptable loss.
  • 2.7M hosts bundled (not 3M) because ~300K had no title (excluded from bundles).

Phase 7: 3M Scale Test [COMPLETED]

Validated disk-based icon storage, performance tuning, and full pipeline at 3M scale.

Final 3M pipeline results:

Stage Duration Result
CC-Index query ~13min 3M hosts
WARC parsing ~3hrs (concurrency 50) 2.8M titles, 6M icons
Icon download 4h21m (408/s) 4.5M completed, 53GB, 70% success
Best icon selection instant 2M hosts with icons
Bundle generation 1h23m (540 hosts/sec) 22,429 bundles, 4.7GB
Frontend deploy seconds Live at everytab.site
Total ~9 hours

Key changes during this phase:

  • Icons stored on local disk (sharded ab/cd/ef/hash), not S3 — saves ~$175 in PUT costs
  • Removed icon size filter — downloads ALL icons for archival, filters at bundle gen time
  • Dropped ORDER BY md5(id::text) from icon claim query — was causing 30-second burst/stall cycles at 3M scale
  • Icon download batch size 200 → 5000, channel buffer = batch size
  • Bundle gen rewritten to stream: paginated DB reads, incremental bundle writes (fixed OOM at 3M)
  • random_order column on hosts table for shuffled bundles
  • EBS volume sized at 1TB for full icon archive
  • Added COSTS.md with monthly cost breakdown (~$42/month ongoing)

Icon selection strategy (TODO: decide before Phase 8)

Now that we download ALL icons (including large 192x192, 512x512, etc.), the best-icon selection for the live site needs thought. Current SQL picks by:

  1. Standard square sizes ≤64 → other squares ≤64 → non-square ≤64 → everything else
  2. Prefer PNG/GIF/ICO over WebP, exclude SVG
  3. Tiebreak by smaller file size

This works but questions remain:

  • Should we prefer favicon.ico over link_rel when quality is similar? (favicon.ico is the universal fallback, link_rel might be higher quality but less reliable)
  • Should we downscale >128px to 32x32 in bundle gen, or let the browser handle it? (affects bundle size vs quality)
  • Should we have different strategies for different icon sources? (e.g., always use link_rel PNG 32x32 if available, fall back to favicon.ico)
  • At 30M scale, how much do large icons bloat total bundle size? Need data from the 3M run to decide.

Phase 7.2: Code Review + Performance Fixes [COMPLETED]

Adversarial code review followed by performance improvements, validated with a 300K run.

Code review findings and fixes

  • float32 pagination bugrandom_order REAL on hosts table caused ~1.5% data loss at 30M scale due to float32 collisions in keyset pagination. Fixed: DOUBLE PRECISION + float64 in Go.
  • Protocol missing from bundles — bundle JSON had host but no protocol. HTTP-only sites (23%) loaded as https:// and broke. Fixed: replaced host field with url (full URL built on Go side).
  • Non-atomic bundle deployment — bundle gen deleted all S3 bundles before writing new ones. Crash mid-write = broken live site. Fixed: overwrite in-place, deploy.sh cleans up stale bundles after cache invalidation.
  • ARCHITECTURE.md stale — still described S3 icon storage, old claim query, old bundle format. Updated throughout to match current code.
  • Dead go.mod dependencies — progressbar and transitive deps removed. Direct vs indirect annotations fixed via go mod tidy.
  • Shadowed builtins — custom min()/max() functions removed (Go 1.21+ builtins).
  • BMP decoder missing — standalone BMP favicons passed download but failed in bundle gen. Added golang.org/x/image/bmp import.
  • Frontend memory leakloadedIcons array grew unboundedly. Capped at 100 entries.
  • Iframe stats inflated — error hosts counted as "iframe blocked" (zero value of bool). Fixed to only count successful parses.
  • CSP check incomplete — only checked first Content-Security-Policy header. Fixed to check all headers via headers.Values().
  • DNS error classification — direct type assertion err.(*net.DNSError) never matched wrapped errors. Fixed with errors.As().
  • Icon download host hammering — adjacent same-host icons in batches caused simultaneous requests. Fixed: Fisher-Yates shuffle of each batch before feeding to workers.
  • Max icons per host — capped at 50 link_rel icons per host in HTML parser to prevent adversarial pages from bloating the DB.
  • downloaded_at column — added to icons table for data freshness tracking.

Pipeline performance redesign

WARC parser — three-stage pipeline:

[DB fetcher] → hostCh → [500 workers] → resultCh → [DB writer with pgx.Batch]
  • Channel-based worker pool (500 goroutines, up from 100 semaphore)
  • S3 retry with 6 attempts (AWS SDK retry.AddWithMaxAttempts)
  • Batched DB writes via pgx.Batch (100 results = ~400 queries per round-trip)
  • Result: 566 hosts/sec (1.6x improvement over 352/sec)

Bundle gen — four-stage pipeline:

[DB fetcher] → hostCh → [20 converters] → entryCh → [assembler] → uploadCh → [10 uploaders]
  • Converters default 20 (CPU-bound, ~5x core count on c5.xlarge)
  • Separate upload workers for S3 PUT parallelism
  • Result: 2,377 hosts/sec (4.4x improvement over 540/sec)

Icon download — in-memory batch shuffle added, concurrency bumped to 1000.

CC-Index query — downloads parquet files locally first (aws s3 sync), then queries with DuckDB. Eliminates S3 503 rate-limit failures.

Best icon selection — new priority: target 32x32 for Retina display. Pick smallest icon ≥32px, fall back to largest <32px. No more "standard sizes" tiers.

Stats improvements

  • WARC parser: added no_title counter, fixed icons_found to include favicon.ico
  • Best icon selection: now writes stats/04_best_icon.json
  • Bundle gen: added bundled_with_icon / bundled_no_icon counters (distinguishes "never had icon" from "convert error")

300K validation run results

Stage Duration Rate
CC-Index query 83s
WARC parsing 8m50s 566 hosts/sec
Icon download 34m45s 439 icons/sec
Best icon selection instant
Bundle generation 1m59s 2,377 hosts/sec
Frontend deploy seconds
Total ~47 min

Loss funnel:

300,000 hosts from CC-Index
 → 282,854 with titles (94.3%)
 → 213,656 bundled with icon (75.5% of titled)
 →  69,198 bundled without icon (24.5%)
     → 68,793 never had an icon
     →    405 icon convert errors
 → 2,358 bundles, 603MB total

Phase 8: 30M Full Run [IN PROGRESS]

Full internet scan. Upgraded to c5.2xlarge + db.m5.large after 300K run revealed bottlenecks.

Infrastructure changes (mid-run)

  • EC2: c5.xlarge → c5.2xlarge (8 vCPUs, 16GB RAM). WARC parsing was CPU-bound at 100% on 4 cores. Icon download was memory-limited at 2500 concurrent connections. Bundle gen CPU-bound. The extra cores and RAM benefit all three stages.
  • RDS: db.t3.medium → db.m5.large (non-burstable, 2 vCPUs, 8GB RAM). The t3 burstable instance was getting CPU-throttled under sustained write load, causing the WARC parser's DB writer to stall and back-pressure workers.
  • Swap: 4GB → 8GB (2x physical RAM as safety margin).
  • DuckDB temp_directory set to EBS (~/duckdb_temp) instead of defaulting to tmpfs. DuckDB's managed spill-to-disk is far more efficient than OS swap — sequential large reads vs random 4KB page faults.

Tuning changes

  • WARC parser concurrency: 500 → 500 (kept, but now on 8 cores instead of 4 — more actual throughput)
  • WARC parser write batch: 100 → 500 (~2000 queries per DB round-trip). Fewer flushes = less back-pressure on workers.
  • WARC parser startup: removed slow COUNT(*) WHERE parsed = FALSE query (scans 26M-row index, takes minutes). Not needed — fetcher discovers empty results naturally.
  • WARC parser channel buffers: hostCh 5K → 20K, resultCh 500 → 1K. Prevents micro-stalls between DB fetcher queries.
  • Icon download concurrency: 1000 → 5000 (16GB RAM supports the connection overhead).
  • Icon download channel buffer: 5K → 20K.
  • Bundle gen concurrency: 20 → 40 (8 cores × 5).
  • Bundle gen channel buffers: 1.2K → 6K.
  • Debug logging added to WARC parser fetcher and writer to diagnose stalls.

Run parameters

  • --limit 0 for CC-Index query
  • CC-Index result: 26,703,146 hosts (20.6M https, 6.1M http)
  • Run in tmux, monitor with psql queries from another session
  • Expected disk: ~650GB for all icons
  • Expected time: ~2 days (with upgraded infra, down from ~3 days estimate)
  • Cost: ~$65 (EC2 c5.2xlarge + RDS m5.large + 1TB EBS for 2-3 days)
  • After completion: deploy frontend, verify live site, backup icons + DB to homelab via rsync
  • Stuck icon recovery (if icon download crashes): UPDATE icons SET scan_state = 'unscanned' WHERE scan_state = 'in_progress';

Lessons learned (during 30M run)

  • Burstable DB instances are unsuitable for pipeline workloads. The t3.medium throttled under sustained writes, causing stalls that propagated through the entire WARC parser pipeline via channel back-pressure. Non-burstable m5 instances provide consistent performance.
  • WARC parsing is CPU-bound, not just I/O-bound. At 500 goroutines on 4 cores, CPU was at 100% — TLS handshakes + gzip decompression + HTML parsing add up. More cores directly increases throughput.
  • Channel buffer sizing matters. Small buffers (5K) caused micro-stalls every time the DB fetcher ran a query. 20K buffers give the fetcher enough runway to query without starving workers.
  • DuckDB temp_directory is critical at scale. Without it, DuckDB spills to tmpfs (RAM-backed), which then swaps to disk via the OS — double indirection. Pointing temp_directory at EBS lets DuckDB manage spill efficiently with large sequential I/O.
  • COUNT(*) on large partial indexes is expensive. The startup query SELECT COUNT(*) FROM hosts WHERE parsed = FALSE on 26M rows took minutes. Unnecessary — just start processing and discover completion naturally.
  • Autovacuum competes with heavy writes for disk I/O. Millions of UPDATEs create dead row versions. Autovacuum kicks in to clean them, saturating disk I/O and stalling writers. Fix: disable autovacuum on hosts/icons during the pipeline run, re-enable + manual VACUUM ANALYZE at the end. Now automated in the WARC parser code.
  • RDS storage must be sized for the full run. 20GB (set during 100K dev) filled up at 16GB during the 30M run. Icons table at 80M rows with indexes needs ~25-30GB. Default bumped to 50GB. RDS storage can only be increased, never decreased.
  • Multiple DB writer goroutines help throughput. Single writer couldn't drain resultCh fast enough — 8 writers with independent buffers keep up to 8 batches in flight to RDS simultaneously.
  • RDS storage optimization causes temporary I/O degradation. After expanding storage, RDS runs background optimization that competes with writes. Can last up to an hour. Plan storage right from the start to avoid mid-run resizes.
  • gp3 IOPS baseline (3000) is a hard limit at scale. 8 DB writers with batch size 500 exhausted EBS I/O burst credits (EBSIOBalance% hit 0%), causing 38ms write latency (normal <5ms) and pipeline stalls. Fix: reduce to 3 writers with batch size 1000 — fewer, larger flushes stay under 3000 IOPS. Custom provisioned IOPS on gp3 requires 400GB+ storage (not worth it for a temp DB).
  • Consider running Postgres locally on EC2 for future runs. RDS gp3 IOPS (3000 baseline) is the main bottleneck for WARC parsing writes. Running Postgres directly on the EC2 instance's 1TB EBS volume eliminates the network hop to RDS and the separate IOPS budget. Also saves the RDS cost ($12-15/run). Tradeoff: must install and configure Postgres yourself (or add to ec2-userdata.sh).
  • Reconsider c5.xlarge for future runs. Upgraded to c5.2xlarge assuming CPU was the bottleneck, but RDS IOPS turned out to be the real constraint for WARC parsing, and icon download is internet-bound. If the extra cores don't meaningfully improve throughput (check CPU utilization during the full run), c5.xlarge at half the cost ($0.17/hr vs $0.34/hr) may be sufficient. The only stage that clearly benefits from 8 cores is bundle gen (~2hrs saved).

Phase 9: Frontend Polish

Before public launch:

  • Cross-browser tab styling — use actual browser screenshots as reference for Chrome/Firefox/Safari
  • Mobile responsive layout
  • Performance — IntersectionObserver to pause off-screen marquee rows, reduce DOM count
  • Stats page — render pipeline stats (host count, icon coverage, crawl date)
  • Test across browsers and devices

Lessons learned

  • Firefox CSS transform animation jitter (Bug 739176). Firefox deliberately snaps transformed elements to device pixels as a performance optimization. At slow marquee speeds (~21 px/sec, <1px/frame at 60fps), this creates visible icon flickering as pixels snap between integer boundaries each frame. Chrome and Safari do smooth sub-pixel GPU interpolation, so they're unaffected.
  • Fix: filter: blur(0px) on Firefox. A no-op CSS filter forces Firefox through a different compositing path that enables sub-pixel rendering during transform animations. Applied via .browser-firefox .tab-row { filter: blur(0px); } using the existing browser detection class on <body>.
  • image-rendering: pixelated conflicts with sub-pixel transforms. Once the filter hack enables sub-pixel rendering, pixelated (nearest-neighbor sampling) causes icons to flicker at sub-pixel positions. Fix: .browser-firefox .tab-icon { image-rendering: auto; } — bilinear filtering interpolates smoothly. Icons are very slightly softer on Firefox but don't flicker. Chrome/Safari keep pixelated.
  • Integer-pixel rAF animation is not a good alternative. Attempted requestAnimationFrame with Math.round()/Math.floor() to control pixel snapping — eliminates flicker but introduces visible stepping at slow speeds. Text jumps are more distracting than the original jitter. The filter hack is strictly better.
  • Web Animations API flickers at loop boundary on Firefox. Even with pixel-precise endpoints (not percentage), Firefox's compositor flickers when an iterations: Infinity animation restarts. Fix: use requestAnimationFrame with continuous sub-pixel positions and modular wrapping (if (pos >= halfWidth) pos -= halfWidth) — no iteration boundary means no flicker. Combined with filter: blur(0px) for sub-pixel smoothness. Chrome/Safari use the Web Animations API without issue.
  • Marquee speed should be px/sec, not fixed duration. Original code used 90-150s fixed duration, but wider screens have wider rows, so the same duration = faster movement. Fix: target 15-25 px/sec and calculate duration from measured row width after DOM insertion.

Phase 10: Parallelization (if needed)

Only pursue if single-machine time (~3-4 days) is unacceptable.

Approach: N machines for WARC parsing + icon download, consolidate, bundle on one machine.

  • WARC parsing: already supports fleet via FOR UPDATE SKIP LOCKED — run same binary on N machines pointing at same RDS
  • Icon download: same SKIP LOCKED pattern — each machine downloads to local disk
  • Consolidation: rsync icons from N machines to one machine
  • Bundle gen: runs on the consolidation machine
  • Alternative: partition by id % N so each machine owns a shard end-to-end (WARC → icons → bundles), no consolidation needed. Bundle numbering uses non-overlapping ranges.
  • Infrastructure: Terraform variable ec2_count = N, each instance gets same IAM/security group
  • Expected speedup: ~linear (4 machines ≈ 4x faster ≈ 1 day for full scan)

Simpler alternative to parallelization:

  • Download only /favicon.ico (skip link_rel icons) — cuts icon count from ~67M to ~30M, roughly halving the longest stage. Minimal quality loss since most best-selected icons are favicon.ico anyway.
  • Use c5n.xlarge (25 Gbps NIC) instead of c5.xlarge (10 Gbps) — check with iftop if network is actually the bottleneck first.

Phase 11: Automation

Monthly pipeline triggered by new Common Crawl release.

  • Shell script wrapping terraform + pipeline stages + deploy
  • Detect new crawl: cron job checking https://index.commoncrawl.org/collinfo.json weekly
  • Compare latest crawl ID against last processed (stored in a file or S3 tag)
  • On new crawl: terraform up → run pipeline → deploy → terraform down
  • Notifications: email/webhook on success or failure
  • Later: move to Forgejo CI with manual trigger button + scheduled trigger

Before Next Run

Infrastructure

  • Run Postgres locally on EC2 instead of RDS. Eliminates the IOPS bottleneck that dominated this run + saves $12-15/run. Add Postgres install to ec2-userdata.sh.
  • Evaluate c5.xlarge vs c5.2xlarge based on CPU utilization data from this run. May not need the extra cores.
  • Pre-size EBS/storage correctly from the start — no mid-run resizes (RDS storage optimization causes I/O degradation).

Pipeline code

  • CC-Index: streaming dedup (INSERT ON CONFLICT) to eliminate DuckDB memory pressure and swap thrashing.
  • WARC parser: tune writer count + batch size based on observed IOPS from this run. With local Postgres, may be able to go back to 8 writers.
  • WARC parser: remove or gate debug log lines ([fetcher], [writer]) behind a --verbose flag.
  • Icon download: measure throughput at 5000 concurrency, decide if worth going higher.
  • Bundle gen: bilinear downscaling instead of nearest-neighbor for >128px icons.
  • SVG rasterization via rsvg-convert (recovers ~3.5% of icons).

Measurement / validation

  • Measure final DB size (tables + indexes) with pg_total_relation_size to right-size storage.
  • Check CPU utilization across all stages to confirm actual bottlenecks.
  • Compare icon download rate at 5000 vs 1000 concurrency.
  • Verify loss funnel numbers add up end-to-end at 30M scale.

Future Improvements (Non-Blocking)

Pipeline

  • CC-Index query: streaming dedup — current GROUP BY builds a ~30M-row hash table in memory, causing severe swap thrashing on c5.xlarge (8GB). Options: (1) use INSERT ... ON CONFLICT (hostname) DO UPDATE to stream rows into Postgres and let the UNIQUE constraint dedup, eliminating the hash table entirely; (2) process parquet files in smaller batches, dedup per-batch in DuckDB, final dedup in Postgres; (3) just use c5.2xlarge (16GB) to fit the hash table in RAM. Current workaround: SET temp_directory to let DuckDB spill to EBS instead of OS swap.
  • Encoding: remaining garbled titles — more aggressive charset detection heuristics
  • Icon download: retry transient failures — single retry for DNS/timeout errors
  • Bundle gen: SVG rasterization — shell out to rsvg-convert for SVG-only hosts (~3.5% of icons)
  • Bundle gen: bilinear downscaling — better quality than nearest-neighbor for >128px icons

Frontend

  • Mobile layout — responsive tab sizing, touch-friendly interaction
  • Stats page / Sankey diagram — pipeline loss funnel rendered on the site

Additional Metadata

  • http_status INT on icons — structured HTTP status code (currently only stored as error text). Enables analysis like "404 (site moved) vs 403 (bot blocked) vs 500 (server error)".
  • response_time_ms INT on icons — server response latency. Useful for tuning timeouts, identifying slow hosts, health signal.
  • parsed_at TIMESTAMPTZ on hosts — when the WARC was parsed. Currently only a parsed boolean.
  • created_at TIMESTAMPTZ DEFAULT now() on hosts — when the host entered the pipeline.