` tag | | iframe_allowed | BOOLEAN | True if site allows framing | | best_icon_s3_key | TEXT | SHA-256 hash of the chosen icon file (denormalized for fast bundle gen) | | parsed | BOOLEAN DEFAULT FALSE | Whether WARC has been parsed | | random_order | DOUBLE PRECISION DEFAULT random() | Random value for shuffled bundle generation pagination | ### `icons` table | Column | Type | Description | |--------|------|-------------| | id | SERIAL PRIMARY KEY | Internal ID | | host_id | INT REFERENCES hosts(id) | FK to parent host | | url | TEXT NOT NULL | Full URL to the icon | | source | TEXT NOT NULL | `favicon_ico` or `link_rel` | | rel_type | TEXT | MIME type from HTML attribute (if specified) | | rel_sizes | TEXT | Sizes attribute from HTML (if specified) | | content_type | TEXT | Actual MIME type after download | | width | INT | Best usable pixel width (for ICO: largest standard size ≤64; for SVG: NULL) | | height | INT | Best usable pixel height (for ICO: largest standard size ≤64; for SVG: NULL) | | file_size | INT | Size in bytes | | s3_key | TEXT | SHA-256 hash of content (used as local file path, legacy column name) | | scan_state | TEXT DEFAULT 'unscanned' | `unscanned`, `in_progress`, `completed`, `failed` | | error | TEXT | Error message if failed | **Indexes:** - `CREATE INDEX idx_icons_unscanned ON icons(id) WHERE scan_state = 'unscanned'` — partial index for work claiming. Only indexes unscanned rows; shrinks as work completes. Minimal write overhead since index only updates on transition OUT of 'unscanned'. - `idx_icons_host_id` on (host_id) — for best-icon selection query **Content-Addressed Storage:** SHA-256 hash of the downloaded icon content, used as the local file path (`ab/cd/ef/{full_hash}`). This gives free dedup — if two sites serve the exact same favicon bytes, we store it once. Before writing, check if the file exists; if so, skip the write but still record the hash in the icons table. ### Bundle JSON format (`tabs/{n}.json`) ```json { "entries": [ { "url": "https://example.com", "title": "Example Domain", "icon": "iVBORw0KGgo...", "icon_w": 32, "icon_h": 32, "iframe_ok": true }, { "url": "http://no-favicon-site.org", "title": "A Site Without Favicon", "icon": "", "iframe_ok": false } ] } ``` Icons are stored inline as base64-encoded PNG. Hosts without favicons are included (with `"icon": ""`) as long as they have a title. CloudFront serves bundles with Brotli compression, which significantly reduces transfer size of base64 data. Bundle size is parameterized (`ENTRIES_PER_BUNDLE`, default 120). Tuned to fill a viewport plus scroll buffer. Average bundle size ~215KB uncompressed, significantly smaller after Brotli. ## Pipeline Stages The pipeline is a series of manually-run scripts executed in order on the single EC2 instance. Each stage is idempotent and resumable. ### Stage 1: CC-Index Query **Tool:** DuckDB with `aws` extension (credential chain) to read parquet directly from S3 **Input:** Common Crawl columnar index (parquet files on `s3://commoncrawl/cc-index/...`) **Query logic:** ```sql WHERE url_path = '/' AND content_mime_type = 'text/html' AND fetch_status = 200 AND url_query IS NULL AND url_protocol IN ('http', 'https') AND url_port IS NULL ``` **Deduplication:** Per hostname, prefer `https` over `http`. Result is one row per unique hostname. **Output:** Populates `hosts` table in RDS (~30M rows for a full crawl). **Cost:** $0 — Common Crawl is part of the AWS Open Data Registry. S3 GET requests and data transfer within us-east-1 are free. **Stats emitted:** Total domains found, https vs http breakdown, duplicates removed. ### Stage 2: WARC Parsing **Tool:** Custom Go program, highly concurrent **Input:** `hosts` table rows where `parsed = FALSE` **Process:** 1. Read batches of unparsed rows (cursor-based pagination by ID) 2. For each row, make a byte-range S3 GetObject request to the `commoncrawl` bucket: - `Range: bytes={offset}-{offset+length-1}` - Uses AWS SDK (not `data.commoncrawl.org` HTTPS endpoint, which rate-limits at ~100 concurrent connections) 3. Parse the WARC record to extract the HTTP response 4. From HTTP response headers: check for `X-Frame-Options` and `Content-Security-Policy` frame-ancestors 5. Parse HTML defensively (lenient parser, handle malformed HTML): - Extract `<title>` tag content - Extract ALL `<link rel="icon">` / `<link rel="shortcut icon">` entries with their href, type, and sizes attributes 6. Insert a `/favicon.ico` entry into `icons` for every host (protocol://hostname/favicon.ico) 7. Insert all discovered `link rel="icon"` entries into `icons` (any format: ICO, PNG, GIF, SVG, WebP, JPEG) 8. Update `hosts` row: html_title, iframe_allowed, parsed = TRUE **Concurrency:** High — thousands of goroutines with a semaphore/pool. CC's S3 handles massive throughput. **Error handling:** Malformed HTML → still extract what we can (partial title, partial icons). WARC fetch failure → log and skip (mark parsed = TRUE with NULL title to avoid retry loops). All errors logged with hostname for investigation. **Icon URL handling:** Relative URLs resolved against `{protocol}://{hostname}/`. Absolute URLs kept as-is. Data URIs ignored. **No scan_state needed:** CC's S3 is highly reliable. The `parsed` boolean is sufficient. If the process crashes mid-batch, re-run picks up where it left off (unparsed rows). **Cost:** $0 (same Open Data program). **Stats emitted:** Rows processed, titles extracted, icons found (by source: favicon_ico vs link_rel), icon format distribution, iframe restrictions found, parse failures, rows with no title. ### Stage 3: Icon Download **Tool:** Custom Go program, highly concurrent **Prerequisite:** Unbound running as system resolver on the EC2 instance. **Input:** ALL `icons` table rows where `scan_state = 'unscanned'` — no size filter. Every `favicon_ico` and `link_rel` icon is downloaded regardless of declared size. The full archive is kept on disk; filtering happens later at best-icon selection and bundle generation. **Process:** 1. Producer goroutine claims batches via `FOR UPDATE SKIP LOCKED`: ```sql UPDATE icons SET scan_state = 'in_progress' WHERE id IN ( SELECT id FROM icons WHERE scan_state = 'unscanned' LIMIT 5000 FOR UPDATE SKIP LOCKED ) RETURNING id, url; ``` Icons are fed into a buffered channel. N worker goroutines consume from the channel, so workers never starve between batch claims. 2. For each icon URL: - Make HTTP(S) GET request (standard Go HTTP client — DNS transparently goes through Unbound) - Shared `http.Transport` for connection pooling and TLS session reuse - Enforce timeouts: 5s connect, 10s total - Enforce max download size: 512KB (generous for icons, but prevents abuse) - On success: - Validate magic bytes (is this actually an image?) - Decode to get dimensions: - PNG/GIF/WebP/JPEG/BMP: read image headers for width/height - ICO: parse ICO header, find largest embedded size ≤64x64 at a standard dimension (16/32/48/64), store THAT size in width/height - SVG: store width=NULL, height=NULL (vector, no pixel size) - Compute SHA-256 of content - Write to local disk at `{icons_dir}/ab/cd/ef/{sha256}` (skip if file already exists — dedup) - Update icons row: s3_key (the SHA-256 hash), content_type (from actual data, not HTTP header), width, height, file_size, scan_state = 'completed' - On failure: scan_state = 'failed', error = reason **Concurrency:** Channel-based worker pool (default 200 workers, configurable). Producer goroutine feeds a buffered channel (buffer = batch size), N workers consume. No starvation between batch claims. **Fast failure strategy:** - DNS failure → fail immediately (Unbound will cache NXDOMAIN) - Connection refused → fail immediately - Timeout → fail after deadline (no retry) - Too large → abort read at 512KB boundary - Not an image → fail (record content-type in error) **Permissive on format:** Download everything — ICO, PNG, GIF, SVG, WebP, JPEG, BMP, whatever the server returns. Store the raw bytes on disk. Format filtering and conversion happens later in bundle generation. **Scaling to fleet (if needed):** - Multiple EC2 instances run the same binary - Each claims work via Postgres row-level locking (`FOR UPDATE SKIP LOCKED`) - No coordinator needed — linear scaling with instance count **Stats emitted:** Icons attempted, completed, failed (breakdown by error type: DNS, timeout, connection refused, HTTP 4xx, HTTP 5xx, invalid image, too large), icons/sec rate, bytes downloaded, dedup hits. ### Stage 4: Best Icon Selection **Tool:** SQL script **Process:** For each host, select the best icon from all its completed downloads. **Selection priority (decision flow):** Target: 32x32 source icon. The frontend displays favicons at 16x16 CSS pixels, which is 32x32 physical pixels on 2x Retina screens. So 32x32 is the ideal source resolution — crisp on Retina without wasting bundle space. 1. **Icons ≥32px** (preferred): smallest first, so closest to 32 wins. A 32x32 beats a 48x48 beats a 180x180. 2. **Icons <32px** (fallback): largest first. A 16x16 beats an 8x8. 3. **Unknown dimensions** (NULL width/height): last resort. Within the same size tier: - Prefer PNG > ICO > GIF/JPEG/BMP > WebP - Tiebreaker: smaller file size SVGs excluded (can't rasterize without external deps). Icons ≤2x2 excluded (tracking pixels). Does not distinguish between `favicon_ico` and `link_rel` sources — purely based on what was actually downloaded and its dimensions/format. Uses `DISTINCT ON (host_id)` for efficient single-pass selection. See `pipeline/04_best_icon/select.sql`. **Stats emitted:** Hosts with icons selected, hosts without any icon. ### Stage 5: Bundle Generation **Tool:** Custom Go program (multi-threaded for image processing) **Input:** All hosts where `html_title IS NOT NULL` (include hosts without icons) **Process:** 1. Stream hosts from RDS in pages (keyset pagination on `random_order` column for shuffled output) 2. For each page, concurrently convert icons (configurable concurrency, default 200): - Read icon from local disk at `{icons_dir}/ab/cd/ef/{hash}` - Decode the image via Go's `image.Decode` (handles PNG, GIF, JPEG, WebP, ICO via registered decoders) - SVGs are excluded (no rasterizer) — these hosts appear without icons - Icons >128px downscaled to 32x32 (nearest-neighbor). Icons ≤128px kept as-is. - Re-encode as PNG, base64-encode 3. Converted entries accumulate in a buffer. Every 120 entries (configurable), serialize as JSON and upload to S3 4. Hosts without icons: included with `"icon": ""` 5. Final partial bundle written at end **Output:** - `tabs/0000.json` through `tabs/{M}.json` in S3 `everytab-site` - Total bundle count M (bake into frontend via deploy script) **Stats emitted:** Total bundles created, total hosts included (with icon / without icon), average bundle size (bytes), total S3 storage used, icon conversion failures. ### Stage 6: Frontend Deploy **Tool:** `pipeline/06_frontend/deploy.sh` **Process:** 1. `sed` injects `const TOTAL_BUNDLES = {M};` into a temp copy of `index.html` 2. Uploads `index.html`, `site.js`, `bot.html`, `about.html` to S3 `everytab-site` 3. Invalidates CloudFront cache for all four files (auto-detects distribution ID) ### Stage 7: Backup & Teardown **Process (manual, with confirmation at each step):** 1. Dump RDS database: `pg_dump -Fc` → transfer to homelab via rsync 2. Sync icons from local disk: `rsync -avP ~/icons/ homelab:/backups/everytab/icons/` 3. **Verify backups:** confirm pg_dump restores cleanly on homelab, spot-check icon files 4. Tear down scanning infra: `terraform apply -var="scanning=false"` (deletes RDS, EC2, icons S3 bucket) ## DNS Architecture **Unbound** runs on the EC2 instance as the system DNS resolver. **Configuration:** - Recursive resolver mode (no forwarding to any upstream — resolves from root servers) - Listening on 127.0.0.1:53 - Set as system resolver in `/etc/resolv.conf` - Aggressive caching enabled - High min-TTL (3600s) — maximizes cache hits for TLD/popular nameservers - High cache size (allocate 1-2GB RAM to Unbound) - Prefetch enabled (refresh popular entries before expiry) **Why recursive instead of forwarding:** Forwarding to Google/Cloudflare would get us rate-limited at 30M+ lookups. Recursive resolution distributes load across thousands of authoritative nameservers. With caching, the actual external query volume is much lower than 30M (most domains share TLD nameservers, many share CDN nameservers). **Transparent to Go:** The Go HTTP client uses the OS resolver, which uses Unbound. No custom transport, no SNI issues, no pre-resolved IPs needed. Standard HTTPS connections with normal hostname verification. ## Frontend Architecture ### File Structure - `index.html` — minimal HTML shell, inline CSS - `site.js` — tab rendering logic, bundle fetching, interaction (separate file for cleanliness, cached after first load) ### Requests Per Visit 1. `GET /index.html` — HTML + CSS (<10KB) 2. `GET /site.js` — JavaScript (cached indefinitely via content hash in filename or cache headers) 3. `GET /tabs/{random}.json` — first bundle (~150-300KB, Brotli-compressed to ~100-200KB) Subsequent scrolls: one additional `/tabs/{n}.json` per "page" of tabs. ### Tab Rendering - Rows of tabs fill the viewport, styled to match the visitor's browser (Chrome, Firefox, Safari — detected via `navigator.userAgent`) - Each row has a bidirectional marquee animation at varying speeds (90-150s per cycle), with random stagger to avoid synchronization - Tabs duplicated in DOM for seamless marquee loop (`translateX(-50%)`) - Each tab shows: favicon (rendered via `<img src="data:image/png;base64,...">`) + truncated title - No-icon tabs: just title text, no icon - Light mode default, auto-switches to dark mode via `prefers-color-scheme` - Hover shows full title as native tooltip ### Interaction - **Click tab (iframe_ok=true):** Opens an inline iframe viewer between tab rows (75vh height, pushes content down) - **Click tab (iframe_ok=false):** Opens site in a new tab (with `↗` external-link indicator on the tab) - **Close viewer:** X button or Escape key. Only one viewer open at a time. - **Scroll down:** When approaching the bottom, fetch next random bundle and render more rows ### Randomization - Seed: `Date.now()` (milliseconds UTC) — every visitor at a different moment sees different tabs - PRNG: seeded random number generator (e.g., mulberry32 or xoshiro) for deterministic sequence from seed - Generate random bundle indices in range `[0, TOTAL_BUNDLES)` - Track fetched bundle IDs in a `Set` to avoid loading duplicates on continued scroll ### Future Enhancements - Mobile-optimized layout - "Search for a site" feature - Stats page (how many sites, coverage, etc.) - Performance: IntersectionObserver to pause off-screen marquee rows ## Statistics & Metadata Each pipeline stage emits a JSON stats file: ``` stats/ 01_cc_index.json 02_warc_parse.json 03_icon_download.json 04_best_icon.json 05_bundle_gen.json ``` After bundle generation, these are merged into a single `stats.json` uploaded to `everytab-site`: ```json { "crawl_id": "CC-MAIN-2026-05", "generated_at": "2026-05-17T12:00:00Z", "pipeline": { "cc_index": { "started_at": "2026-05-17T08:00:00Z", "finished_at": "2026-05-17T08:42:00Z", "duration_seconds": 2520, "total_domains": 31245678, "https": 28901234, "http_only": 2344444, "duplicates_removed": 1456789 }, "warc_parse": { "started_at": "2026-05-17T08:45:00Z", "finished_at": "2026-05-17T12:15:00Z", "duration_seconds": 12600, "processed": 31245678, "titles_extracted": 29876543, "icons_found": 45678901, "iframe_restricted": 12345678, "parse_failures": 234567 }, "icon_download": { "started_at": "2026-05-17T12:20:00Z", "finished_at": "2026-05-18T18:30:00Z", "duration_seconds": 108600, "attempted": 45678901, "completed": 38901234, "failed_dns": 2345678, "failed_timeout": 1234567, "failed_http_error": 1567890, "failed_invalid_image": 890123, "failed_too_large": 12345, "unique_icons_stored": 34567890, "dedup_hits": 4333344 }, "best_icon": { "started_at": "2026-05-18T18:35:00Z", "finished_at": "2026-05-18T18:40:00Z", "duration_seconds": 300, "hosts_with_icon": 27654321, "hosts_without_icon": 3591357 }, "bundles": { "started_at": "2026-05-18T18:45:00Z", "finished_at": "2026-05-18T20:10:00Z", "duration_seconds": 5100, "total_bundles": 52341, "total_hosts_included": 29876543, "hosts_with_icon": 27654321, "hosts_without_icon": 2222222, "excluded_no_title": 1369135, "avg_bundle_size_bytes": 245000 } } } ``` This is served publicly at `/stats.json` on the live site — interesting metadata for visitors and useful for monitoring pipeline health across crawls. ## Cost Estimate ### Scanning Phase (One-Time per Crawl) | Item | Estimate | |------|----------| | EC2 c5.xlarge (~3-4 days) | $12-16 | | EBS 1TB gp3 (~4 days) | $10 | | RDS db.t3.medium (~4 days) | $4-6 | | Common Crawl S3 reads (CC-Index + WARCs) | $0 (Open Data) | | Data transfer (icon downloads from internet, inbound) | $0 (inbound free) | | Data transfer (backup to homelab, outbound) | $5-45 (depends on icon archive size) | | **Total** | **~$31-77** | ### Hosting Phase (Monthly Steady-State) | Item | Estimate | |------|----------| | S3 everytab-site storage (~10-15GB of bundles) | $0.35 | | CloudFront (free tier: 1TB/month transfer, 10M requests/month) | $0 | | S3 origin requests via CloudFront (heavily cached) | $1-3 | | **Total** | **~$2-4/month** | Note: Bundle storage estimate revised down. With ~50K bundles at ~250KB each = ~12.5GB, well under previous estimate since we're targeting viewport-fill (100-150 tabs) not 1MB bundles. If the site gets significant traffic beyond CloudFront free tier, costs scale with usage — but that's a success problem. ## Scaling Strategy ### Development Phase (100K domains) - Cap CC-Index query to 100K rows - Full pipeline runs in minutes - Validates end-to-end correctness - Frontend development and tab-density tuning ### Full Scan (30M domains) - Single EC2 instance, high concurrency - CC-Index query: <1hr (httpfs) or ~2hrs (download + local query) - WARC parsing: 2-6hrs - Icon download: 12-48hrs (the long pole) - Bundle generation: 1-2hrs - Total: ~1-2 days ### Fleet Scaling (if single instance is too slow) - Spin up N identical EC2 instances running the icon downloader - All connect to the same RDS instance - Work claiming via `FOR UPDATE SKIP LOCKED` — no double work, no coordinator - Linear throughput scaling: 4 instances ≈ 4x download speed - Only the icon download stage benefits from fleet (other stages are fast enough solo) ## Key Design Decisions 1. **Static-only hosting** — No servers for the live site. Everything pre-built. Minimal attack surface, minimal cost. 2. **Inline icons in bundles** — One fetch gives you 100+ tabs to render. No per-icon requests. 3. **Base64 + Brotli** — Base64 for browser-native decoding (`atob()`). Brotli compression at the CDN layer reduces transfer size by ~25-30% for free. 4. **Unbound as system resolver** — Transparent to application code. Standard Go HTTP. No custom networking. 5. **SHA-256 content-addressed icon storage** — Natural dedup on local disk. Same favicon stored once even if referenced by multiple hosts. 6. **Permissive download, selective bundling** — Download ALL favicon formats and sizes during scanning. Convert to optimized PNG only during bundle generation. Decouples "capture as much as possible" from "serve the best version." 7. **Partial index for work claiming** — Indexes only unscanned rows. Shrinks as work progresses. Minimal write amplification. 8. **Local disk for icons, S3 for site** — Icons stored on EBS during scanning (avoids ~$175 in S3 PUT costs at 30M scale). Only the static site lives in S3 behind CloudFront. 9. **Per-millisecond random seed** — Every visitor sees a unique arrangement. No shared state, no server needed for randomization. 10. **Viewport-sized bundles** — ~100-150 tabs per bundle, tuned to fill a screen. Faster loads, smaller memory footprint than 1MB bundles. 11. **Include no-icon hosts** — Any host with a title is included. Firefox-style rendering (title only) for hosts without favicons. 12. **Denormalized best_icon_s3_key in hosts** — Stores the SHA-256 hash of the chosen icon. Avoids joins during bundle generation. Written once during icon selection, read once during bundling.

# EveryTab Architecture ## System Overview EveryTab is a static website that displays a page full of browser tabs representing every website on the internet. The system has two phases: 1. **Scanning Phase** — A data pipeline that extracts website metadata from Common Crawl, downloads favicons, and processes them into servable bundles. 2. **Hosting Phase** — A static site served via S3 + CloudFront that renders tabs using pre-built JSON bundles. The scanning phase runs monthly (triggered by new Common Crawl releases), produces a static site, and then its infrastructure is torn down after backing up data to the homelab. The hosting phase runs indefinitely at minimal cost. ## Workflow Diagram ```mermaid flowchart TD subgraph EC2["Scanning Phase (EC2 instance)"] A["Stage 1: Query CC-Index via DuckDB"] B["Stage 2: Parse WARCs - Go"] C["Stage 3: Download Icons - Go"] D["Stage 4: Select Best Icons"] E["Stage 5: Generate Bundles - Go"] F["Stage 6: Deploy Frontend"] UB["Unbound - Local recursive resolver"] DISK["Local disk - Sharded icon archive"] A --> B --> C --> D --> E --> F UB -.-> C C --> DISK DISK --> E end subgraph ExtData["External Data"] CC["Common Crawl S3 - Parquet Index + WARCs"] end subgraph AWS["AWS Services"] RDS[("RDS Postgres - hosts + icons tables")] S3S["S3: everytab-site - tabs/*.json + index.html"] CF["CloudFront CDN"] end subgraph Post["Post-Scan"] BAK["Backup to Homelab - RDS dump + icons rsync"] TEAR["Teardown - Delete RDS, EC2"] end CC --> A CC --> B A --> RDS B --> RDS C --> RDS D --> RDS E --> S3S F --> S3S S3S --> CF F --> BAK BAK --> TEAR ``` **Key point:** DuckDB, Go programs, and Unbound all run on the same EC2 instance. The pipeline is sequential — one stage completes before the next begins. ## AWS Infrastructure All resources in **us-east-1**. | Resource | Purpose | Lifecycle | |----------|---------|-----------| | EC2 (c5.xlarge) + 1TB EBS | Run all pipeline stages, store icon archive | Scanning only | | RDS Postgres (db.t3.medium) | Store hosts/icons metadata | Scanning only (backup to homelab, then delete) | | S3 `everytab-site` | Static site: index.html, site.js, tabs/*.json | Permanent | | CloudFront | CDN for static site (Brotli compression enabled) | Permanent | | S3 `everytab-logs` | CloudFront access logs | Permanent | | Unbound (on EC2) | Local recursive DNS resolver | Scanning only (runs on EC2) | ### Icon Storage Icons are stored on local disk during scanning, not S3. The EBS volume holds the full icon archive in a sharded directory structure (`ab/cd/ef/{sha256}`). This avoids ~$175 in S3 PUT costs at 30M scale. After scanning completes, icons are backed up to the homelab via rsync. ### Steady-State (Hosting Only) - S3 `everytab-site` — index.html + site.js + ~250K JSON bundles - CloudFront distribution — Brotli-compressed delivery, caching ## Data Model ### `hosts` table | Column | Type | Description | |--------|------|-------------| | id | SERIAL PRIMARY KEY | Internal ID | | hostname | TEXT NOT NULL UNIQUE | e.g., `example.com` | | protocol | TEXT NOT NULL | `https` or `http` (prefer https) | | crawl_id | TEXT NOT NULL | CC crawl identifier (e.g., `CC-MAIN-2026-05`) | | warc_filename | TEXT NOT NULL | Path to WARC file in CC's S3 | | warc_record_offset | BIGINT NOT NULL | Byte offset into WARC file | | warc_record_length | INT NOT NULL | Length of WARC record | | html_title | TEXT | Extracted from `` tag | | iframe_allowed | BOOLEAN | True if site allows framing | | best_icon_s3_key | TEXT | SHA-256 hash of the chosen icon file (denormalized for fast bundle gen) | | parsed | BOOLEAN DEFAULT FALSE | Whether WARC has been parsed | | random_order | DOUBLE PRECISION DEFAULT random() | Random value for shuffled bundle generation pagination | ### `icons` table | Column | Type | Description | |--------|------|-------------| | id | SERIAL PRIMARY KEY | Internal ID | | host_id | INT REFERENCES hosts(id) | FK to parent host | | url | TEXT NOT NULL | Full URL to the icon | | source | TEXT NOT NULL | `favicon_ico` or `link_rel` | | rel_type | TEXT | MIME type from HTML attribute (if specified) | | rel_sizes | TEXT | Sizes attribute from HTML (if specified) | | content_type | TEXT | Actual MIME type after download | | width | INT | Best usable pixel width (for ICO: largest standard size ≤64; for SVG: NULL) | | height | INT | Best usable pixel height (for ICO: largest standard size ≤64; for SVG: NULL) | | file_size | INT | Size in bytes | | s3_key | TEXT | SHA-256 hash of content (used as local file path, legacy column name) | | scan_state | TEXT DEFAULT 'unscanned' | `unscanned`, `in_progress`, `completed`, `failed` | | error | TEXT | Error message if failed | **Indexes:** - `CREATE INDEX idx_icons_unscanned ON icons(id) WHERE scan_state = 'unscanned'` — partial index for work claiming. Only indexes unscanned rows; shrinks as work completes. Minimal write overhead since index only updates on transition OUT of 'unscanned'. - `idx_icons_host_id` on (host_id) — for best-icon selection query **Content-Addressed Storage:** SHA-256 hash of the downloaded icon content, used as the local file path (`ab/cd/ef/{full_hash}`). This gives free dedup — if two sites serve the exact same favicon bytes, we store it once. Before writing, check if the file exists; if so, skip the write but still record the hash in the icons table. ### Bundle JSON format (`tabs/{n}.json`) ```json { "entries": [ { "url": "https://example.com", "title": "Example Domain", "icon": "iVBORw0KGgo...", "icon_w": 32, "icon_h": 32, "iframe_ok": true }, { "url": "http://no-favicon-site.org", "title": "A Site Without Favicon", "icon": "", "iframe_ok": false } ] } ``` Icons are stored inline as base64-encoded PNG. Hosts without favicons are included (with `"icon": ""`) as long as they have a title. CloudFront serves bundles with Brotli compression, which significantly reduces transfer size of base64 data. Bundle size is parameterized (`ENTRIES_PER_BUNDLE`, default 120). Tuned to fill a viewport plus scroll buffer. Average bundle size ~215KB uncompressed, significantly smaller after Brotli. ## Pipeline Stages The pipeline is a series of manually-run scripts executed in order on the single EC2 instance. Each stage is idempotent and resumable. ### Stage 1: CC-Index Query **Tool:** DuckDB with `aws` extension (credential chain) to read parquet directly from S3 **Input:** Common Crawl columnar index (parquet files on `s3://commoncrawl/cc-index/...`) **Query logic:** ```sql WHERE url_path = '/' AND content_mime_type = 'text/html' AND fetch_status = 200 AND url_query IS NULL AND url_protocol IN ('http', 'https') AND url_port IS NULL ``` **Deduplication:** Per hostname, prefer `https` over `http`. Result is one row per unique hostname. **Output:** Populates `hosts` table in RDS (~30M rows for a full crawl). **Cost:** $0 — Common Crawl is part of the AWS Open Data Registry. S3 GET requests and data transfer within us-east-1 are free. **Stats emitted:** Total domains found, https vs http breakdown, duplicates removed. ### Stage 2: WARC Parsing **Tool:** Custom Go program, highly concurrent **Input:** `hosts` table rows where `parsed = FALSE` **Process:** 1. Read batches of unparsed rows (cursor-based pagination by ID) 2. For each row, make a byte-range S3 GetObject request to the `commoncrawl` bucket: - `Range: bytes={offset}-{offset+length-1}` - Uses AWS SDK (not `data.commoncrawl.org` HTTPS endpoint, which rate-limits at ~100 concurrent connections) 3. Parse the WARC record to extract the HTTP response 4. From HTTP response headers: check for `X-Frame-Options` and `Content-Security-Policy` frame-ancestors 5. Parse HTML defensively (lenient parser, handle malformed HTML): - Extract `<title>` tag content - Extract ALL `<link rel="icon">` / `<link rel="shortcut icon">` entries with their href, type, and sizes attributes 6. Insert a `/favicon.ico` entry into `icons` for every host (protocol://hostname/favicon.ico) 7. Insert all discovered `link rel="icon"` entries into `icons` (any format: ICO, PNG, GIF, SVG, WebP, JPEG) 8. Update `hosts` row: html_title, iframe_allowed, parsed = TRUE **Concurrency:** High — thousands of goroutines with a semaphore/pool. CC's S3 handles massive throughput. **Error handling:** Malformed HTML → still extract what we can (partial title, partial icons). WARC fetch failure → log and skip (mark parsed = TRUE with NULL title to avoid retry loops). All errors logged with hostname for investigation. **Icon URL handling:** Relative URLs resolved against `{protocol}://{hostname}/`. Absolute URLs kept as-is. Data URIs ignored. **No scan_state needed:** CC's S3 is highly reliable. The `parsed` boolean is sufficient. If the process crashes mid-batch, re-run picks up where it left off (unparsed rows). **Cost:** $0 (same Open Data program). **Stats emitted:** Rows processed, titles extracted, icons found (by source: favicon_ico vs link_rel), icon format distribution, iframe restrictions found, parse failures, rows with no title. ### Stage 3: Icon Download **Tool:** Custom Go program, highly concurrent **Prerequisite:** Unbound running as system resolver on the EC2 instance. **Input:** ALL `icons` table rows where `scan_state = 'unscanned'` — no size filter. Every `favicon_ico` and `link_rel` icon is downloaded regardless of declared size. The full archive is kept on disk; filtering happens later at best-icon selection and bundle generation. **Process:** 1. Producer goroutine claims batches via `FOR UPDATE SKIP LOCKED`: ```sql UPDATE icons SET scan_state = 'in_progress' WHERE id IN ( SELECT id FROM icons WHERE scan_state = 'unscanned' LIMIT 5000 FOR UPDATE SKIP LOCKED ) RETURNING id, url; ``` Icons are fed into a buffered channel. N worker goroutines consume from the channel, so workers never starve between batch claims. 2. For each icon URL: - Make HTTP(S) GET request (standard Go HTTP client — DNS transparently goes through Unbound) - Shared `http.Transport` for connection pooling and TLS session reuse - Enforce timeouts: 5s connect, 10s total - Enforce max download size: 512KB (generous for icons, but prevents abuse) - On success: - Validate magic bytes (is this actually an image?) - Decode to get dimensions: - PNG/GIF/WebP/JPEG/BMP: read image headers for width/height - ICO: parse ICO header, find largest embedded size ≤64x64 at a standard dimension (16/32/48/64), store THAT size in width/height - SVG: store width=NULL, height=NULL (vector, no pixel size) - Compute SHA-256 of content - Write to local disk at `{icons_dir}/ab/cd/ef/{sha256}` (skip if file already exists — dedup) - Update icons row: s3_key (the SHA-256 hash), content_type (from actual data, not HTTP header), width, height, file_size, scan_state = 'completed' - On failure: scan_state = 'failed', error = reason **Concurrency:** Channel-based worker pool (default 200 workers, configurable). Producer goroutine feeds a buffered channel (buffer = batch size), N workers consume. No starvation between batch claims. **Fast failure strategy:** - DNS failure → fail immediately (Unbound will cache NXDOMAIN) - Connection refused → fail immediately - Timeout → fail after deadline (no retry) - Too large → abort read at 512KB boundary - Not an image → fail (record content-type in error) **Permissive on format:** Download everything — ICO, PNG, GIF, SVG, WebP, JPEG, BMP, whatever the server returns. Store the raw bytes on disk. Format filtering and conversion happens later in bundle generation. **Scaling to fleet (if needed):** - Multiple EC2 instances run the same binary - Each claims work via Postgres row-level locking (`FOR UPDATE SKIP LOCKED`) - No coordinator needed — linear scaling with instance count **Stats emitted:** Icons attempted, completed, failed (breakdown by error type: DNS, timeout, connection refused, HTTP 4xx, HTTP 5xx, invalid image, too large), icons/sec rate, bytes downloaded, dedup hits. ### Stage 4: Best Icon Selection **Tool:** SQL script **Process:** For each host, select the best icon from all its completed downloads. **Selection priority (decision flow):** Target: 32x32 source icon. The frontend displays favicons at 16x16 CSS pixels, which is 32x32 physical pixels on 2x Retina screens. So 32x32 is the ideal source resolution — crisp on Retina without wasting bundle space. 1. **Icons ≥32px** (preferred): smallest first, so closest to 32 wins. A 32x32 beats a 48x48 beats a 180x180. 2. **Icons <32px** (fallback): largest first. A 16x16 beats an 8x8. 3. **Unknown dimensions** (NULL width/height): last resort. Within the same size tier: - Prefer PNG > ICO > GIF/JPEG/BMP > WebP - Tiebreaker: smaller file size SVGs excluded (can't rasterize without external deps). Icons ≤2x2 excluded (tracking pixels). Does not distinguish between `favicon_ico` and `link_rel` sources — purely based on what was actually downloaded and its dimensions/format. Uses `DISTINCT ON (host_id)` for efficient single-pass selection. See `pipeline/04_best_icon/select.sql`. **Stats emitted:** Hosts with icons selected, hosts without any icon. ### Stage 5: Bundle Generation **Tool:** Custom Go program (multi-threaded for image processing) **Input:** All hosts where `html_title IS NOT NULL` (include hosts without icons) **Process:** 1. Stream hosts from RDS in pages (keyset pagination on `random_order` column for shuffled output) 2. For each page, concurrently convert icons (configurable concurrency, default 200): - Read icon from local disk at `{icons_dir}/ab/cd/ef/{hash}` - Decode the image via Go's `image.Decode` (handles PNG, GIF, JPEG, WebP, ICO via registered decoders) - SVGs are excluded (no rasterizer) — these hosts appear without icons - Icons >128px downscaled to 32x32 (nearest-neighbor). Icons ≤128px kept as-is. - Re-encode as PNG, base64-encode 3. Converted entries accumulate in a buffer. Every 120 entries (configurable), serialize as JSON and upload to S3 4. Hosts without icons: included with `"icon": ""` 5. Final partial bundle written at end **Output:** - `tabs/0000.json` through `tabs/{M}.json` in S3 `everytab-site` - Total bundle count M (bake into frontend via deploy script) **Stats emitted:** Total bundles created, total hosts included (with icon / without icon), average bundle size (bytes), total S3 storage used, icon conversion failures. ### Stage 6: Frontend Deploy **Tool:** `pipeline/06_frontend/deploy.sh` **Process:** 1. `sed` injects `const TOTAL_BUNDLES = {M};` into a temp copy of `index.html` 2. Uploads `index.html`, `site.js`, `bot.html`, `about.html` to S3 `everytab-site` 3. Invalidates CloudFront cache for all four files (auto-detects distribution ID) ### Stage 7: Backup & Teardown **Process (manual, with confirmation at each step):** 1. Dump RDS database: `pg_dump -Fc` → transfer to homelab via rsync 2. Sync icons from local disk: `rsync -avP ~/icons/ homelab:/backups/everytab/icons/` 3. **Verify backups:** confirm pg_dump restores cleanly on homelab, spot-check icon files 4. Tear down scanning infra: `terraform apply -var="scanning=false"` (deletes RDS, EC2, icons S3 bucket) ## DNS Architecture **Unbound** runs on the EC2 instance as the system DNS resolver. **Configuration:** - Recursive resolver mode (no forwarding to any upstream — resolves from root servers) - Listening on 127.0.0.1:53 - Set as system resolver in `/etc/resolv.conf` - Aggressive caching enabled - High min-TTL (3600s) — maximizes cache hits for TLD/popular nameservers - High cache size (allocate 1-2GB RAM to Unbound) - Prefetch enabled (refresh popular entries before expiry) **Why recursive instead of forwarding:** Forwarding to Google/Cloudflare would get us rate-limited at 30M+ lookups. Recursive resolution distributes load across thousands of authoritative nameservers. With caching, the actual external query volume is much lower than 30M (most domains share TLD nameservers, many share CDN nameservers). **Transparent to Go:** The Go HTTP client uses the OS resolver, which uses Unbound. No custom transport, no SNI issues, no pre-resolved IPs needed. Standard HTTPS connections with normal hostname verification. ## Frontend Architecture ### File Structure - `index.html` — minimal HTML shell, inline CSS - `site.js` — tab rendering logic, bundle fetching, interaction (separate file for cleanliness, cached after first load) ### Requests Per Visit 1. `GET /index.html` — HTML + CSS (<10KB) 2. `GET /site.js` — JavaScript (cached indefinitely via content hash in filename or cache headers) 3. `GET /tabs/{random}.json` — first bundle (~150-300KB, Brotli-compressed to ~100-200KB) Subsequent scrolls: one additional `/tabs/{n}.json` per "page" of tabs. ### Tab Rendering - Rows of tabs fill the viewport, styled to match the visitor's browser (Chrome, Firefox, Safari — detected via `navigator.userAgent`) - Each row has a bidirectional marquee animation at varying speeds (90-150s per cycle), with random stagger to avoid synchronization - Tabs duplicated in DOM for seamless marquee loop (`translateX(-50%)`) - Each tab shows: favicon (rendered via `<img src="data:image/png;base64,...">`) + truncated title - No-icon tabs: just title text, no icon - Light mode default, auto-switches to dark mode via `prefers-color-scheme` - Hover shows full title as native tooltip ### Interaction - **Click tab (iframe_ok=true):** Opens an inline iframe viewer between tab rows (75vh height, pushes content down) - **Click tab (iframe_ok=false):** Opens site in a new tab (with `↗` external-link indicator on the tab) - **Close viewer:** X button or Escape key. Only one viewer open at a time. - **Scroll down:** When approaching the bottom, fetch next random bundle and render more rows ### Randomization - Seed: `Date.now()` (milliseconds UTC) — every visitor at a different moment sees different tabs - PRNG: seeded random number generator (e.g., mulberry32 or xoshiro) for deterministic sequence from seed - Generate random bundle indices in range `[0, TOTAL_BUNDLES)` - Track fetched bundle IDs in a `Set` to avoid loading duplicates on continued scroll ### Future Enhancements - Mobile-optimized layout - "Search for a site" feature - Stats page (how many sites, coverage, etc.) - Performance: IntersectionObserver to pause off-screen marquee rows ## Statistics & Metadata Each pipeline stage emits a JSON stats file: ``` stats/ 01_cc_index.json 02_warc_parse.json 03_icon_download.json 04_best_icon.json 05_bundle_gen.json ``` After bundle generation, these are merged into a single `stats.json` uploaded to `everytab-site`: ```json { "crawl_id": "CC-MAIN-2026-05", "generated_at": "2026-05-17T12:00:00Z", "pipeline": { "cc_index": { "started_at": "2026-05-17T08:00:00Z", "finished_at": "2026-05-17T08:42:00Z", "duration_seconds": 2520, "total_domains": 31245678, "https": 28901234, "http_only": 2344444, "duplicates_removed": 1456789 }, "warc_parse": { "started_at": "2026-05-17T08:45:00Z", "finished_at": "2026-05-17T12:15:00Z", "duration_seconds": 12600, "processed": 31245678, "titles_extracted": 29876543, "icons_found": 45678901, "iframe_restricted": 12345678, "parse_failures": 234567 }, "icon_download": { "started_at": "2026-05-17T12:20:00Z", "finished_at": "2026-05-18T18:30:00Z", "duration_seconds": 108600, "attempted": 45678901, "completed": 38901234, "failed_dns": 2345678, "failed_timeout": 1234567, "failed_http_error": 1567890, "failed_invalid_image": 890123, "failed_too_large": 12345, "unique_icons_stored": 34567890, "dedup_hits": 4333344 }, "best_icon": { "started_at": "2026-05-18T18:35:00Z", "finished_at": "2026-05-18T18:40:00Z", "duration_seconds": 300, "hosts_with_icon": 27654321, "hosts_without_icon": 3591357 }, "bundles": { "started_at": "2026-05-18T18:45:00Z", "finished_at": "2026-05-18T20:10:00Z", "duration_seconds": 5100, "total_bundles": 52341, "total_hosts_included": 29876543, "hosts_with_icon": 27654321, "hosts_without_icon": 2222222, "excluded_no_title": 1369135, "avg_bundle_size_bytes": 245000 } } } ``` This is served publicly at `/stats.json` on the live site — interesting metadata for visitors and useful for monitoring pipeline health across crawls. ## Cost Estimate ### Scanning Phase (One-Time per Crawl) | Item | Estimate | |------|----------| | EC2 c5.xlarge (~3-4 days) | $12-16 | | EBS 1TB gp3 (~4 days) | $10 | | RDS db.t3.medium (~4 days) | $4-6 | | Common Crawl S3 reads (CC-Index + WARCs) | $0 (Open Data) | | Data transfer (icon downloads from internet, inbound) | $0 (inbound free) | | Data transfer (backup to homelab, outbound) | $5-45 (depends on icon archive size) | | **Total** | **~$31-77** | ### Hosting Phase (Monthly Steady-State) | Item | Estimate | |------|----------| | S3 everytab-site storage (~10-15GB of bundles) | $0.35 | | CloudFront (free tier: 1TB/month transfer, 10M requests/month) | $0 | | S3 origin requests via CloudFront (heavily cached) | $1-3 | | **Total** | **~$2-4/month** | Note: Bundle storage estimate revised down. With ~50K bundles at ~250KB each = ~12.5GB, well under previous estimate since we're targeting viewport-fill (100-150 tabs) not 1MB bundles. If the site gets significant traffic beyond CloudFront free tier, costs scale with usage — but that's a success problem. ## Scaling Strategy ### Development Phase (100K domains) - Cap CC-Index query to 100K rows - Full pipeline runs in minutes - Validates end-to-end correctness - Frontend development and tab-density tuning ### Full Scan (30M domains) - Single EC2 instance, high concurrency - CC-Index query: <1hr (httpfs) or ~2hrs (download + local query) - WARC parsing: 2-6hrs - Icon download: 12-48hrs (the long pole) - Bundle generation: 1-2hrs - Total: ~1-2 days ### Fleet Scaling (if single instance is too slow) - Spin up N identical EC2 instances running the icon downloader - All connect to the same RDS instance - Work claiming via `FOR UPDATE SKIP LOCKED` — no double work, no coordinator - Linear throughput scaling: 4 instances ≈ 4x download speed - Only the icon download stage benefits from fleet (other stages are fast enough solo) ## Key Design Decisions 1. **Static-only hosting** — No servers for the live site. Everything pre-built. Minimal attack surface, minimal cost. 2. **Inline icons in bundles** — One fetch gives you 100+ tabs to render. No per-icon requests. 3. **Base64 + Brotli** — Base64 for browser-native decoding (`atob()`). Brotli compression at the CDN layer reduces transfer size by ~25-30% for free. 4. **Unbound as system resolver** — Transparent to application code. Standard Go HTTP. No custom networking. 5. **SHA-256 content-addressed icon storage** — Natural dedup on local disk. Same favicon stored once even if referenced by multiple hosts. 6. **Permissive download, selective bundling** — Download ALL favicon formats and sizes during scanning. Convert to optimized PNG only during bundle generation. Decouples "capture as much as possible" from "serve the best version." 7. **Partial index for work claiming** — Indexes only unscanned rows. Shrinks as work progresses. Minimal write amplification. 8. **Local disk for icons, S3 for site** — Icons stored on EBS during scanning (avoids ~$175 in S3 PUT costs at 30M scale). Only the static site lives in S3 behind CloudFront. 9. **Per-millisecond random seed** — Every visitor sees a unique arrangement. No shared state, no server needed for randomization. 10. **Viewport-sized bundles** — ~100-150 tabs per bundle, tuned to fill a screen. Faster loads, smaller memory footprint than 1MB bundles. 11. **Include no-icon hosts** — Any host with a title is included. Firefox-style rendering (title only) for hosts without favicons. 12. **Denormalized best_icon_s3_key in hosts** — Stores the SHA-256 hash of the chosen icon. Avoids joins during bundle generation. Written once during icon selection, read once during bundling.