diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md index 4eeb3ac..86f439e 100644 --- a/ARCHITECTURE.md +++ b/ARCHITECTURE.md @@ -7,24 +7,58 @@ EveryTab is a static website that displays a page full of browser tabs represent 1. **Scanning Phase** — A data pipeline that extracts website metadata from Common Crawl, downloads favicons, and processes them into servable bundles. 2. **Hosting Phase** — A static site served via S3 + CloudFront that renders tabs using pre-built JSON bundles. -The scanning phase runs monthly (triggered by new Common Crawl releases), produces a static site, and then its infrastructure is torn down. The hosting phase runs indefinitely at minimal cost. +The scanning phase runs monthly (triggered by new Common Crawl releases), produces a static site, and then its infrastructure is torn down after backing up data to the homelab. The hosting phase runs indefinitely at minimal cost. +## Workflow Diagram + +```mermaid +flowchart TD + subgraph "Scanning Phase (EC2 instance)" + A[Stage 1: Query CC-Index via DuckDB] --> B[Stage 2: Parse WARCs - Go] + B --> C[Stage 3: Download Icons - Go] + C --> D[Stage 4: Select Best Icons] + D --> E[Stage 5: Generate Bundles - Go] + E --> F[Stage 6: Build Frontend] + end + + subgraph "External Data" + CC[Common Crawl S3\nParquet Index + WARCs] + end + + subgraph "AWS Services" + RDS[(RDS Postgres\nhosts + icons tables)] + S3I[S3: everytab-icons\nRaw downloaded favicons] + S3S[S3: everytab-site\ntabs/*.json + index.html] + CF[CloudFront CDN] + end + + subgraph "Post-Scan" + BAK[Backup to Homelab\nRDS dump + icons sync] + TEAR[Teardown\nDelete RDS, icons bucket, EC2] + end + + CC --> A + CC --> B + A --> RDS + B --> RDS + B --> S3I + C --> S3I + C --> RDS + D --> RDS + E --> S3S + F --> S3S + S3S --> CF + + F --> BAK + BAK --> TEAR + + subgraph "DNS" + UB[Unbound\nLocal recursive resolver\non EC2] + end + UB -.-> C ``` -Common Crawl (S3) - | - v -[EC2 + DuckDB] ---> [RDS Postgres] ---> [EC2 + Go programs] ---> S3 (icons/) - | | | | - | (hosts, icons | | - | tables) v v - | | [Bundle Generator] ---> S3 (tabs/*.json) - | | | - | v v - | [Backup to homelab] S3 (index.html) - | | - v v - [Tear down EC2, RDS] [CloudFront CDN] -``` + +**Key point:** DuckDB, Go programs, and Unbound all run on the same EC2 instance. The pipeline is sequential — one stage completes before the next begins. ## AWS Infrastructure @@ -32,21 +66,24 @@ All resources in **us-east-1**. | Resource | Purpose | Lifecycle | |----------|---------|-----------| -| EC2 (xlarge, compute-optimized) | Run pipeline stages | Scanning only | -| RDS Postgres (db.t3.medium) | Store hosts/icons metadata | Scanning only (backup then delete) | -| S3 `everytab-icons` | Raw downloaded favicons | Scanning only (backup then delete) | -| S3 `everytab-site` | Static site: index.html + tabs/*.json | Permanent | -| CloudFront | CDN for static site | Permanent | -| Unbound (on EC2) | Local recursive DNS resolver | Scanning only | +| EC2 (c5.xlarge) | Run all pipeline stages | Scanning only | +| RDS Postgres (db.t3.medium) | Store hosts/icons metadata | Scanning only (backup to homelab, then delete) | +| S3 `everytab-icons` | Raw downloaded favicons | Scanning only (backup to homelab, then delete) | +| S3 `everytab-site` | Static site: index.html, site.js, tabs/*.json | Permanent | +| CloudFront | CDN for static site (Brotli compression enabled) | Permanent | +| Unbound (on EC2) | Local recursive DNS resolver | Scanning only (runs on EC2) | + +### Why Two S3 Buckets + +- `everytab-site` is configured as a CloudFront origin with public read access (via OAC). The entire bucket IS the website. +- `everytab-icons` is completely private — only the EC2 instance reads/writes to it. No public access configuration needed. +- Backup is clean: `aws s3 sync s3://everytab-icons/ /homelab/path/` grabs the whole bucket. +- Deletion is clean: `aws s3 rb s3://everytab-icons --force` — zero risk of nuking the live site. +- One bucket with prefix-based policies works but is fiddlier (CloudFront must serve `tabs/` and `index.html` but NOT `icons/`). Two buckets eliminates that surface area for misconfiguration. ### Steady-State (Hosting Only) -- S3 `everytab-site` — stores index.html + ~50K JSON bundle files (~60GB total) -- CloudFront distribution — serves the site with caching - -### Scanning Phase (Temporary) -- EC2 instance — runs all processing (no persistent local storage needed beyond OS) -- RDS — structured data store during pipeline execution -- S3 `everytab-icons` — temporary storage for downloaded favicons +- S3 `everytab-site` — index.html + site.js + ~50K JSON bundles +- CloudFront distribution — Brotli-compressed delivery, caching ## Data Model @@ -55,19 +92,17 @@ All resources in **us-east-1**. | Column | Type | Description | |--------|------|-------------| | id | SERIAL PRIMARY KEY | Internal ID | -| hostname | TEXT NOT NULL | e.g., `example.com` | +| hostname | TEXT NOT NULL UNIQUE | e.g., `example.com` | | protocol | TEXT NOT NULL | `https` or `http` (prefer https) | | crawl_id | TEXT NOT NULL | CC crawl identifier (e.g., `CC-MAIN-2026-05`) | | warc_filename | TEXT NOT NULL | Path to WARC file in CC's S3 | | warc_record_offset | BIGINT NOT NULL | Byte offset into WARC file | | warc_record_length | INT NOT NULL | Length of WARC record | | html_title | TEXT | Extracted from `` tag | -| iframe_allowed | BOOLEAN | True if site allows framing (no X-Frame-Options/CSP restriction) | -| best_icon_id | INT REFERENCES icons(id) | FK to the chosen icon for bundling | +| iframe_allowed | BOOLEAN | True if site allows framing | +| best_icon_s3_key | TEXT | S3 key of the chosen icon (denormalized for fast bundle gen) | | parsed | BOOLEAN DEFAULT FALSE | Whether WARC has been parsed | -**Constraints:** UNIQUE(hostname) — one row per domain, prefer https over http. - ### `icons` table | Column | Type | Description | @@ -76,18 +111,23 @@ All resources in **us-east-1**. | host_id | INT REFERENCES hosts(id) | FK to parent host | | url | TEXT NOT NULL | Full URL to the icon | | source | TEXT NOT NULL | `favicon_ico` or `link_rel` | -| content_type | TEXT | MIME type after download (image/png, image/x-icon, etc.) | +| rel_type | TEXT | MIME type from HTML attribute (if specified) | +| rel_sizes | TEXT | Sizes attribute from HTML (if specified) | +| content_type | TEXT | Actual MIME type after download | | width | INT | Decoded pixel width | | height | INT | Decoded pixel height | -| s3_key | TEXT | Key in everytab-icons bucket | +| file_size | INT | Size in bytes | +| s3_key | TEXT | Key in everytab-icons bucket (SHA-256 of content) | | scan_state | TEXT DEFAULT 'unscanned' | `unscanned`, `in_progress`, `completed`, `failed` | | error | TEXT | Error message if failed | **Indexes:** -- `idx_icons_scan_state` on (scan_state) — for batch claiming work -- `idx_icons_host_id` on (host_id) — for best-icon selection +- `CREATE INDEX idx_icons_unscanned ON icons(id) WHERE scan_state = 'unscanned'` — partial index for work claiming. Only indexes unscanned rows; shrinks as work completes. Minimal write overhead since index only updates on transition OUT of 'unscanned'. +- `idx_icons_host_id` on (host_id) — for best-icon selection query -### Bundle JSON format (`tabs/0001.json`) +**S3 Key Strategy:** SHA-256 hash of the downloaded icon content. This gives free dedup at the storage layer — if two sites serve the exact same favicon bytes, we store it once. The hash is computed client-side (by the Go downloader) and used as the key. Before uploading, check if the key exists; if so, skip the upload but still record the s3_key in the icons table. + +### Bundle JSON format (`tabs/{n}.json`) ```json { @@ -99,22 +139,30 @@ All resources in **us-east-1**. "icon_w": 32, "icon_h": 32, "iframe_ok": true + }, + { + "host": "no-favicon-site.org", + "title": "A Site Without Favicon", + "icon": "", + "iframe_ok": false } ] } ``` -Icons are stored inline as base64-encoded PNG. Each bundle targets ~1MB, yielding approximately 500-700 entries per bundle depending on icon sizes. +Icons are stored inline as base64-encoded PNG. Hosts without favicons are included (with `"icon": ""`) as long as they have a title. CloudFront serves bundles with Brotli compression, which significantly reduces transfer size of base64 data. + +Bundle size targets ~100-150 entries (enough to fill a viewport with buffer for scrolling). Estimated ~150-300KB per bundle uncompressed, smaller after Brotli. ## Pipeline Stages -The pipeline is a series of manually-run scripts executed in order. Each stage is idempotent and resumable. +The pipeline is a series of manually-run scripts executed in order on the single EC2 instance. Each stage is idempotent and resumable. ### Stage 1: CC-Index Query -**Tool:** DuckDB with httpfs extension (or local parquet if httpfs takes >1hr) +**Tool:** DuckDB with httpfs extension (query CC parquet directly from S3; if >1hr, fall back to downloading parquet locally first) -**Input:** Common Crawl columnar index (parquet files on CC's S3) +**Input:** Common Crawl columnar index (parquet files on `s3://commoncrawl/cc-index/...`) **Query logic:** ```sql @@ -130,6 +178,8 @@ WHERE url_path = '/' **Output:** Populates `hosts` table in RDS (~30M rows for a full crawl). +**Cost:** $0 — Common Crawl is part of the AWS Open Data Registry. S3 GET requests and data transfer within us-east-1 are free. + **Stats emitted:** Total domains found, https vs http breakdown, duplicates removed. ### Stage 2: WARC Parsing @@ -139,171 +189,278 @@ WHERE url_path = '/' **Input:** `hosts` table rows where `parsed = FALSE` **Process:** -1. Claim a batch of rows (set parsed = TRUE optimistically, or use a cursor) +1. Read batches of unparsed rows (cursor-based pagination by ID) 2. For each row, make a byte-range GET request to Common Crawl's S3: - `Range: bytes={offset}-{offset+length-1}` - - Target: `s3://commoncrawl/{warc_filename}` + - Target: `https://data.commoncrawl.org/{warc_filename}` 3. Parse the WARC record to extract the HTTP response -4. Parse HTML (defensively — handle malformed HTML, use a lenient parser): +4. From HTTP response headers: check for `X-Frame-Options` and `Content-Security-Policy` frame-ancestors +5. Parse HTML defensively (lenient parser, handle malformed HTML): - Extract `<title>` tag content - - Extract `<link rel="icon">` href values (filter to png/gif/ico, sizes 16-64px) - - Check HTTP response headers for `X-Frame-Options` and CSP `frame-ancestors` -5. Insert a `/favicon.ico` entry into `icons` for every host (always attempt this) -6. Insert any qualifying `link rel="icon"` entries into `icons` -7. Update `hosts` row with `html_title`, `iframe_allowed`, `parsed = TRUE` + - Extract ALL `<link rel="icon">` / `<link rel="shortcut icon">` entries with their href, type, and sizes attributes +6. Insert a `/favicon.ico` entry into `icons` for every host (protocol://hostname/favicon.ico) +7. Insert all discovered `link rel="icon"` entries into `icons` (any format: ICO, PNG, GIF, SVG, WebP, JPEG) +8. Update `hosts` row: html_title, iframe_allowed, parsed = TRUE -**Concurrency:** High — thousands of goroutines. S3 byte-range requests are the bottleneck; S3 handles 5,500+ GET/s per prefix and WARC files are spread across many prefixes. +**Concurrency:** High — thousands of goroutines with a semaphore/pool. CC's S3 handles massive throughput. -**Error handling:** If HTML is unparseable, mark as parsed with NULL title. If WARC fetch fails, retry once then skip. Log all errors with hostname for investigation. +**Error handling:** Malformed HTML → still extract what we can (partial title, partial icons). WARC fetch failure → log and skip (mark parsed = TRUE with NULL title to avoid retry loops). All errors logged with hostname for investigation. -**Stats emitted:** Rows processed, titles extracted, icons found (by type), iframe restrictions found, parse failures. +**Icon URL handling:** Relative URLs resolved against `{protocol}://{hostname}/`. Absolute URLs kept as-is. Data URIs ignored. -### Stage 3: DNS Resolution Setup +**No scan_state needed:** CC's S3 is highly reliable. The `parsed` boolean is sufficient. If the process crashes mid-batch, re-run picks up where it left off (unparsed rows). -**Tool:** Unbound, installed and configured on EC2 +**Cost:** $0 (same Open Data program). -**Configuration:** -- Recursive resolver (no forwarding to upstream) -- Listening on 127.0.0.1:53 -- Aggressive caching enabled -- High min-TTL (e.g., 3600s) to maximize cache hits across similar domains -- Configured as system resolver in `/etc/resolv.conf` +**Stats emitted:** Rows processed, titles extracted, icons found (by source: favicon_ico vs link_rel), icon format distribution, iframe restrictions found, parse failures, rows with no title. -This runs as a background service. No separate "DNS resolution stage" — the Go icon downloader's HTTP requests transparently use Unbound via the OS resolver. Unbound handles recursive resolution and caching. - -**Why:** Downloading 30M+ icons without a local recursive resolver would overwhelm upstream DNS providers and likely get us rate-limited. Unbound resolves from root servers directly, caches aggressively, and handles the load locally. - -### Stage 4: Icon Download +### Stage 3: Icon Download **Tool:** Custom Go program, highly concurrent +**Prerequisite:** Unbound running as system resolver on the EC2 instance. + **Input:** `icons` table rows where `scan_state = 'unscanned'` **Process:** -1. Claim a batch of rows (UPDATE scan_state = 'in_progress' WHERE scan_state = 'unscanned' LIMIT N RETURNING *) +1. Claim batch: `UPDATE icons SET scan_state = 'in_progress' WHERE scan_state = 'unscanned' AND id IN (SELECT id FROM icons WHERE scan_state = 'unscanned' LIMIT N FOR UPDATE SKIP LOCKED) RETURNING *` 2. For each icon URL: - - Make HTTP(S) GET request (normal Go HTTP client, DNS goes through Unbound) - - Enforce timeout (5s connect, 10s total) - - Enforce max download size (512KB — generous for icons) - - On success: validate it's an image (check magic bytes), decode to get dimensions - - Upload raw bytes to S3 `everytab-icons/{hash}` (content-addressed) - - Update `icons` row: s3_key, content_type, width, height, scan_state = 'completed' + - Make HTTP(S) GET request (standard Go HTTP client — DNS transparently goes through Unbound) + - Enforce timeouts: 5s connect, 10s total + - Enforce max download size: 512KB (generous for icons, but prevents abuse) + - On success: + - Validate magic bytes (is this actually an image?) + - Decode to get dimensions (width, height) — just read headers, don't fully decode + - Compute SHA-256 of content + - Upload to S3 `everytab-icons/{sha256}` (skip if key already exists — dedup) + - Update icons row: s3_key, content_type (from actual data, not HTTP header), width, height, file_size, scan_state = 'completed' - On failure: scan_state = 'failed', error = reason -**Concurrency:** Maximize throughput — goroutine pool with configurable size (start at 1000, tune based on memory/bandwidth). Use semaphore pattern for backpressure. +**Concurrency:** Goroutine pool with configurable size (start 1000, tune based on system resources). Semaphore pattern for backpressure. Monitor memory usage. -**Fast failure:** DNS errors, connection refused, timeouts all fail immediately (no retry for icons — if it's down, it's down). This keeps the long tail short. +**Fast failure strategy:** +- DNS failure → fail immediately (Unbound will cache NXDOMAIN) +- Connection refused → fail immediately +- Timeout → fail after deadline (no retry) +- Too large → abort read at 512KB boundary +- Not an image → fail (record content-type in error) -**Scaling to fleet:** If a single instance is insufficient: +**Permissive on format:** Download everything — ICO, PNG, GIF, SVG, WebP, JPEG, BMP, whatever the server returns. Store the raw bytes in S3. Format filtering and conversion happens later in bundle generation. + +**Scaling to fleet (if needed):** - Multiple EC2 instances run the same binary -- Each claims work via the `scan_state` UPDATE (Postgres row-level locking prevents double-work) -- No coordination needed beyond the shared database +- Each claims work via Postgres row-level locking (`FOR UPDATE SKIP LOCKED`) +- No coordinator needed — linear scaling with instance count -**Stats emitted:** Icons attempted, completed, failed (by error type: DNS, timeout, HTTP error, invalid image, too large), download rate (icons/sec), bytes downloaded. +**Stats emitted:** Icons attempted, completed, failed (breakdown by error type: DNS, timeout, connection refused, HTTP 4xx, HTTP 5xx, invalid image, too large), icons/sec rate, bytes downloaded, unique S3 keys (dedup hits). -### Stage 5: Best Icon Selection +### Stage 4: Best Icon Selection -**Tool:** SQL query or small script +**Tool:** SQL script -**Process:** -For each host, select the best icon from its completed icons: -1. Filter to standard sizes: 16x16, 32x32, 48x48, 64x64 -2. Among those, pick the largest dimensions (prefer 64 > 48 > 32 > 16) -3. If no standard sizes found, pick the largest icon with dimensions <= 64px on both axes -4. If no icons at all, host gets a NULL best_icon_id (will use default in frontend) +**Process:** For each host, select the best icon from its completed downloads: ```sql -UPDATE hosts h SET best_icon_id = ( - SELECT id FROM icons i - WHERE i.host_id = h.id AND i.scan_state = 'completed' +UPDATE hosts h SET best_icon_s3_key = ( + SELECT i.s3_key FROM icons i + WHERE i.host_id = h.id + AND i.scan_state = 'completed' ORDER BY - (width IN (16,32,48,64) AND height IN (16,32,48,64)) DESC, - width DESC + -- Prefer standard square sizes + CASE + WHEN i.width = i.height AND i.width IN (64, 48, 32, 16) THEN 0 + WHEN i.width = i.height AND i.width <= 64 THEN 1 + WHEN i.width <= 64 AND i.height <= 64 THEN 2 + ELSE 3 + END, + -- Among valid options, prefer larger + i.width DESC, + -- Prefer PNG/GIF/ICO over SVG/WebP for simpler processing + CASE + WHEN i.content_type IN ('image/png', 'image/gif', 'image/x-icon', 'image/vnd.microsoft.icon') THEN 0 + WHEN i.content_type IN ('image/webp') THEN 1 + WHEN i.content_type IN ('image/svg+xml') THEN 2 + ELSE 3 + END, + -- Smaller file size as tiebreaker + i.file_size ASC LIMIT 1 ); ``` -**Stats emitted:** Hosts with icons, hosts without icons, icon size distribution. +**Note on SVG/WebP:** These are downloaded and stored during scanning but are lower priority for bundle selection. Rasterizing SVG to PNG adds complexity; WebP re-encoding to PNG may increase size. If a host ONLY has SVG/WebP icons, we still use them (convert in bundle generation). But if PNG/GIF/ICO alternatives exist, prefer those. -### Stage 6: Bundle Generation +**Stats emitted:** Hosts with icons selected, hosts without any icon, icon size distribution, format distribution of selected icons. -**Tool:** Custom Go program +### Stage 5: Bundle Generation -**Input:** `hosts` table (joined with their best icon from S3) +**Tool:** Custom Go program (multi-threaded for image processing) + +**Input:** All hosts where `html_title IS NOT NULL` (include hosts without icons) **Process:** -1. Query all hosts where best_icon_id IS NOT NULL (or include no-icon hosts with a default flag) -2. Randomize the full result set (ORDER BY random() or shuffle in memory) -3. For each host: - - Download its best icon from S3 `everytab-icons` - - Decode the icon (ICO/GIF/PNG/etc.) - - For ICO files: extract the largest embedded image at a standard size <= 64x64 - - Re-encode as PNG (optimized compression) +1. Query all qualifying hosts from RDS (with their best_icon_s3_key) +2. Randomize the full result set +3. For each host with an icon (best_icon_s3_key IS NOT NULL): + - Download from S3 `everytab-icons/{s3_key}` + - Decode the image (handle ICO, PNG, GIF, WebP, SVG): + - ICO: extract the largest embedded image at a standard size <= 64x64, decode to raster + - SVG: rasterize to 32x32 PNG + - WebP/GIF/BMP: decode to raster + - PNG: use as-is (re-compress if possible) + - Re-encode as optimized PNG (preserve original dimensions, don't upscale) - Base64-encode the PNG bytes -4. Chunk into groups of N entries (~500-700, tuned so each JSON is ~1MB) -5. Write each chunk as `tabs/{n}.json` to S3 `everytab-site` -6. Record total bundle count +4. For hosts without icons: set icon to empty string +5. Chunk into groups of N entries (~100-150, tuned to fill a viewport) +6. Serialize each chunk as JSON, write to S3 `everytab-site/tabs/{n}.json` +7. Record total bundle count **Output:** -- `tabs/0000.json` through `tabs/{M}.json` in S3 -- Total bundle count M (used in frontend build) +- `tabs/0.json` through `tabs/{M}.json` in S3 `everytab-site` +- Total bundle count M +- `stats.json` in S3 `everytab-site` (pipeline statistics) -**Stats emitted:** Total bundles created, total hosts included, total hosts excluded (no icon), average bundle size, total S3 storage used. +**Stats emitted:** Total bundles created, total hosts included (with icon / without icon), average bundle size (bytes), total S3 storage used, icon conversion failures. -### Stage 7: Frontend Build +### Stage 6: Frontend Build -**Tool:** Script/template that produces `index.html` +**Tool:** Simple script or template engine **Process:** -1. Inject `TOTAL_BUNDLES` constant into the JS (baked at build time) -2. Minify if desired -3. Upload `index.html` to S3 `everytab-site` root +1. Inject `const TOTAL_BUNDLES = {M};` into the JS +2. Write `index.html` and `site.js` to S3 `everytab-site` +3. Invalidate CloudFront distribution (`/*`) -### Stage 8: CloudFront Invalidation +### Stage 7: Backup & Teardown -Invalidate `/*` on the CloudFront distribution so the new site is live. - -### Stage 9: Backup & Teardown - -**Process:** -1. Dump RDS database to local machine (homelab) — `pg_dump` over SSH tunnel or direct -2. Sync S3 `everytab-icons` to homelab storage — `aws s3 sync` -3. Confirm backups are complete -4. Delete RDS instance +**Process (manual, with confirmation):** +1. Dump RDS database: `pg_dump` → transfer to homelab +2. Sync icons: `aws s3 sync s3://everytab-icons/ homelab:/path/to/backup/icons/` +3. **Confirm backups are complete and verified** +4. Delete RDS instance (with final snapshot as safety net) 5. Delete S3 `everytab-icons` bucket 6. Terminate EC2 instance +## DNS Architecture + +**Unbound** runs on the EC2 instance as the system DNS resolver. + +**Configuration:** +- Recursive resolver mode (no forwarding to any upstream — resolves from root servers) +- Listening on 127.0.0.1:53 +- Set as system resolver in `/etc/resolv.conf` +- Aggressive caching enabled +- High min-TTL (3600s) — maximizes cache hits for TLD/popular nameservers +- High cache size (allocate 1-2GB RAM to Unbound) +- Prefetch enabled (refresh popular entries before expiry) + +**Why recursive instead of forwarding:** Forwarding to Google/Cloudflare would get us rate-limited at 30M+ lookups. Recursive resolution distributes load across thousands of authoritative nameservers. With caching, the actual external query volume is much lower than 30M (most domains share TLD nameservers, many share CDN nameservers). + +**Transparent to Go:** The Go HTTP client uses the OS resolver, which uses Unbound. No custom transport, no SNI issues, no pre-resolved IPs needed. Standard HTTPS connections with normal hostname verification. + ## Frontend Architecture -### Single-File Design +### File Structure +- `index.html` — minimal HTML shell, inline CSS +- `site.js` — tab rendering logic, bundle fetching, interaction (separate file for cleanliness, cached after first load) -One `index.html` containing inline CSS and JS. No external dependencies, no framework. Two HTTP requests per initial page load: -1. `GET /index.html` (HTML + CSS + JS, likely <50KB) -2. `GET /tabs/{random}.json` (~1MB, one bundle of ~500-700 tabs) +### Requests Per Visit +1. `GET /index.html` — HTML + CSS (<10KB) +2. `GET /site.js` — JavaScript (cached indefinitely via content hash in filename or cache headers) +3. `GET /tabs/{random}.json` — first bundle (~150-300KB, Brotli-compressed to ~100-200KB) + +Subsequent scrolls: one additional `/tabs/{n}.json` per "page" of tabs. ### Tab Rendering -- Tabs fill the viewport in rows, styled to mimic Firefox browser tabs (v1) -- Each row has a slight horizontal marquee animation (CSS) at varying speeds +- Rows of tabs fill the viewport, styled to mimic Firefox browser tabs (v1) +- Each row has a subtle horizontal marquee animation (CSS `@keyframes` / `animation`) at slightly varying speeds - Tab density adapts to viewport width (responsive) -- Each tab shows: favicon (or blank for no-icon) + truncated title +- Each tab shows: favicon (rendered via `<img src="data:image/png;base64,...">`) + truncated title +- No-icon tabs: just title text, no icon (Firefox behavior) +- Enough tabs rendered to fill viewport + buffer below fold (so user can scroll immediately without waiting for next fetch) ### Interaction - **Click tab (iframe_ok=true):** Opens an iframe overlay showing the actual site -- **Click tab (iframe_ok=false):** Opens site in a new tab (with external link indicator) -- **Close:** X button or click-away dismisses the iframe/overlay -- **Scroll down:** Triggers fetch of additional random bundles (infinite scroll) +- **Click tab (iframe_ok=false):** Opens site in a new tab (with subtle external-link indicator on the tab) +- **Close overlay:** X button or click outside dismisses iframe +- **Scroll down:** When approaching the bottom, fetch next random bundle and render more rows ### Randomization -- Seed: current UTC date (so everyone on the same day sees the same "shuffle", but it changes daily) -- Generate random bundle index in range [0, TOTAL_BUNDLES) -- Track fetched bundle IDs in a Set to avoid duplicates on scroll +- Seed: `Date.now()` (milliseconds UTC) — every visitor at a different moment sees different tabs +- PRNG: seeded random number generator (e.g., mulberry32 or xoshiro) for deterministic sequence from seed +- Generate random bundle indices in range `[0, TOTAL_BUNDLES)` +- Track fetched bundle IDs in a `Set` to avoid loading duplicates on continued scroll -### No-Icon Hosts +### Future Enhancements (v2+) +- Browser-specific tab styles (Chrome tabs for Chrome users, Safari for Safari, etc.) +- Mobile-optimized layout +- "Search for a site" feature +- Stats page (how many sites, coverage, etc.) -Hosts without a favicon are included in bundles with `"icon": null`. Frontend renders these Firefox-style: just the title text with no icon. This matches Firefox's behavior for tabs without favicons. +## Statistics & Metadata + +Each pipeline stage emits a JSON stats file: + +``` +stats/ + 01_cc_index.json + 02_warc_parse.json + 03_icon_download.json + 04_best_icon.json + 05_bundle_gen.json +``` + +After bundle generation, these are merged into a single `stats.json` uploaded to `everytab-site`: + +```json +{ + "crawl_id": "CC-MAIN-2026-05", + "generated_at": "2026-05-17T12:00:00Z", + "pipeline": { + "cc_index": { + "total_domains": 31245678, + "https": 28901234, + "http_only": 2344444, + "duplicates_removed": 1456789 + }, + "warc_parse": { + "processed": 31245678, + "titles_extracted": 29876543, + "icons_found": 45678901, + "iframe_restricted": 12345678, + "parse_failures": 234567 + }, + "icon_download": { + "attempted": 45678901, + "completed": 38901234, + "failed_dns": 2345678, + "failed_timeout": 1234567, + "failed_http_error": 1567890, + "failed_invalid_image": 890123, + "failed_too_large": 12345, + "unique_icons_stored": 34567890, + "dedup_hits": 4333344 + }, + "best_icon": { + "hosts_with_icon": 27654321, + "hosts_without_icon": 3591357 + }, + "bundles": { + "total_bundles": 52341, + "total_hosts_included": 29876543, + "hosts_with_icon": 27654321, + "hosts_without_icon": 2222222, + "excluded_no_title": 1369135, + "avg_bundle_size_bytes": 245000 + } + } +} +``` + +This is served publicly at `/stats.json` on the live site — interesting metadata for visitors and useful for monitoring pipeline health across crawls. ## Cost Estimate @@ -312,49 +469,61 @@ Hosts without a favicon are included in bundles with `"icon": null`. Frontend re | Item | Estimate | |------|----------| | EC2 c5.xlarge (~24-48hrs) | $8-16 | -| RDS db.t3.medium (~48hrs) | $3-5 | -| S3 icons storage (temporary, ~500GB) | $12 (prorated to days) | -| S3 GET requests (30M WARC reads) | $12 | -| Data transfer (icon downloads, ~500GB inbound) | $0 (inbound is free) | -| **Total** | **~$35-45** | +| RDS db.t3.medium (~48-72hrs including dev time) | $3-7 | +| S3 everytab-icons storage (~500GB, prorated to days) | $1-3 | +| S3 PUT requests (icon uploads, ~30M) | $15 | +| Common Crawl S3 reads (CC-Index + WARCs) | $0 (Open Data) | +| Data transfer (icon downloads from internet, inbound) | $0 (inbound free) | +| Data transfer (backup to homelab, outbound) | $5-10 | +| **Total** | **~$32-51** | ### Hosting Phase (Monthly Steady-State) | Item | Estimate | |------|----------| -| S3 storage (~60GB bundles) | $1.40 | -| CloudFront (free tier: 1TB/month, 10M requests) | $0* | -| S3 requests (via CloudFront origin pulls, cached) | ~$1-5 | -| **Total** | **~$3-10/month** | +| S3 everytab-site storage (~10-15GB of bundles) | $0.35 | +| CloudFront (free tier: 1TB/month transfer, 10M requests/month) | $0 | +| S3 origin requests via CloudFront (heavily cached) | $1-3 | +| **Total** | **~$2-4/month** | -*CloudFront free tier covers moderate traffic. Costs increase if the site goes viral, but that's a good problem to have. +Note: Bundle storage estimate revised down. With ~50K bundles at ~250KB each = ~12.5GB, well under previous estimate since we're targeting viewport-fill (100-150 tabs) not 1MB bundles. + +If the site gets significant traffic beyond CloudFront free tier, costs scale with usage — but that's a success problem. ## Scaling Strategy -### Development (100K domains) -- Single EC2 instance -- All stages complete in minutes-to-hours -- Good for validating the full pipeline end-to-end +### Development Phase (100K domains) +- Cap CC-Index query to 100K rows +- Full pipeline runs in minutes +- Validates end-to-end correctness +- Frontend development and tab-density tuning ### Full Scan (30M domains) - Single EC2 instance, high concurrency -- CC-Index query: <1hr -- WARC parsing: 2-6hrs (limited by S3 request rate) -- Icon download: 12-48hrs (limited by network + remote server response times) +- CC-Index query: <1hr (httpfs) or ~2hrs (download + local query) +- WARC parsing: 2-6hrs +- Icon download: 12-48hrs (the long pole) - Bundle generation: 1-2hrs +- Total: ~1-2 days -### Fleet Scaling (if needed) +### Fleet Scaling (if single instance is too slow) - Spin up N identical EC2 instances running the icon downloader -- All share the same RDS instance -- Work claiming via Postgres atomic UPDATEs (no coordinator needed) -- Linear scaling: 4 instances = ~4x throughput +- All connect to the same RDS instance +- Work claiming via `FOR UPDATE SKIP LOCKED` — no double work, no coordinator +- Linear throughput scaling: 4 instances ≈ 4x download speed +- Only the icon download stage benefits from fleet (other stages are fast enough solo) ## Key Design Decisions -1. **Static-only hosting** — No servers running for the live site. Entire frontend is pre-built. -2. **Inline icons in bundles** — No per-icon requests. One bundle fetch gives you ~600 tabs to render. -3. **Unbound as system resolver** — Transparent to application code. Go HTTP client works normally; DNS just happens to resolve locally. -4. **Content-addressed icon storage** — S3 key is the content hash. Natural dedup at storage layer during scanning (but icons are duplicated across bundles for simplicity). -5. **Resumable pipeline** — Each stage uses database state (parsed, scan_state) to track progress. Crash and restart without re-doing completed work. -6. **PNG as universal icon format** — All icons converted to PNG for bundles regardless of source format. Smallest file size for small raster images, universally supported in browsers via data URIs. -7. **Date-seeded randomization** — Everyone visiting on the same day sees the same tab arrangement, creating a shared experience. Changes daily for freshness. +1. **Static-only hosting** — No servers for the live site. Everything pre-built. Minimal attack surface, minimal cost. +2. **Inline icons in bundles** — One fetch gives you 100+ tabs to render. No per-icon requests. +3. **Base64 + Brotli** — Base64 for browser-native decoding (`atob()`). Brotli compression at the CDN layer reduces transfer size by ~25-30% for free. +4. **Unbound as system resolver** — Transparent to application code. Standard Go HTTP. No custom networking. +5. **SHA-256 content-addressed icon storage** — Natural dedup at S3 layer. Same favicon stored once even if referenced by multiple hosts. +6. **Permissive download, selective bundling** — Download ALL favicon formats during scanning. Convert to optimized PNG only during bundle generation. Decouples "capture as much as possible" from "serve the best version." +7. **Partial index for work claiming** — Indexes only unscanned rows. Shrinks as work progresses. Minimal write amplification. +8. **Two S3 buckets** — Clean separation of concerns. Private working storage vs public site. Safe deletion of temporary data. +9. **Per-millisecond random seed** — Every visitor sees a unique arrangement. No shared state, no server needed for randomization. +10. **Viewport-sized bundles** — ~100-150 tabs per bundle, tuned to fill a screen. Faster loads, smaller memory footprint than 1MB bundles. +11. **Include no-icon hosts** — Any host with a title is included. Firefox-style rendering (title only) for hosts without favicons. +12. **Denormalized best_icon_s3_key in hosts** — Avoids joins during bundle generation. Written once during icon selection, read once during bundling.