everytab/ARCHITECTURE.md

25 KiB

EveryTab Architecture

System Overview

EveryTab is a static website that displays a page full of browser tabs representing every website on the internet. The system has two phases:

  1. Scanning Phase — A data pipeline that extracts website metadata from Common Crawl, downloads favicons, and processes them into servable bundles.
  2. Hosting Phase — A static site served via S3 + CloudFront that renders tabs using pre-built JSON bundles.

The scanning phase runs monthly (triggered by new Common Crawl releases), produces a static site, and then its infrastructure is torn down after backing up data to the homelab. The hosting phase runs indefinitely at minimal cost.

Workflow Diagram

flowchart TD
    subgraph EC2["Scanning Phase (EC2 instance)"]
        A["Stage 1: Query CC-Index via DuckDB"]
        B["Stage 2: Parse WARCs - Go"]
        C["Stage 3: Download Icons - Go"]
        D["Stage 4: Select Best Icons"]
        E["Stage 5: Generate Bundles - Go"]
        F["Stage 6: Build Frontend"]
        UB["Unbound - Local recursive resolver"]

        A --> B --> C --> D --> E --> F
        UB -.-> C
    end

    subgraph ExtData["External Data"]
        CC["Common Crawl S3 - Parquet Index + WARCs"]
    end

    subgraph AWS["AWS Services"]
        RDS[("RDS Postgres - hosts + icons tables")]
        S3I["S3: everytab-icons - Raw downloaded favicons"]
        S3S["S3: everytab-site - tabs/*.json + index.html"]
        CF["CloudFront CDN"]
    end

    subgraph Post["Post-Scan"]
        BAK["Backup to Homelab - RDS dump + icons sync"]
        TEAR["Teardown - Delete RDS, icons bucket, EC2"]
    end

    CC --> A
    CC --> B
    A --> RDS
    B --> RDS
    B --> S3I
    C --> S3I
    C --> RDS
    D --> RDS
    E --> S3S
    F --> S3S
    S3S --> CF

    F --> BAK
    BAK --> TEAR

Key point: DuckDB, Go programs, and Unbound all run on the same EC2 instance. The pipeline is sequential — one stage completes before the next begins.

AWS Infrastructure

All resources in us-east-1.

Resource Purpose Lifecycle
EC2 (c5.xlarge) Run all pipeline stages Scanning only
RDS Postgres (db.t3.medium) Store hosts/icons metadata Scanning only (backup to homelab, then delete)
S3 everytab-icons Raw downloaded favicons Scanning only (backup to homelab, then delete)
S3 everytab-site Static site: index.html, site.js, tabs/*.json Permanent
CloudFront CDN for static site (Brotli compression enabled) Permanent
Unbound (on EC2) Local recursive DNS resolver Scanning only (runs on EC2)

Why Two S3 Buckets

  • everytab-site is configured as a CloudFront origin with public read access (via OAC). The entire bucket IS the website.
  • everytab-icons is completely private — only the EC2 instance reads/writes to it. No public access configuration needed.
  • Backup is clean: aws s3 sync s3://everytab-icons/ /homelab/path/ grabs the whole bucket.
  • Deletion is clean: aws s3 rb s3://everytab-icons --force — zero risk of nuking the live site.
  • One bucket with prefix-based policies works but is fiddlier (CloudFront must serve tabs/ and index.html but NOT icons/). Two buckets eliminates that surface area for misconfiguration.

Steady-State (Hosting Only)

  • S3 everytab-site — index.html + site.js + ~50K JSON bundles
  • CloudFront distribution — Brotli-compressed delivery, caching

Data Model

hosts table

Column Type Description
id SERIAL PRIMARY KEY Internal ID
hostname TEXT NOT NULL UNIQUE e.g., example.com
protocol TEXT NOT NULL https or http (prefer https)
crawl_id TEXT NOT NULL CC crawl identifier (e.g., CC-MAIN-2026-05)
warc_filename TEXT NOT NULL Path to WARC file in CC's S3
warc_record_offset BIGINT NOT NULL Byte offset into WARC file
warc_record_length INT NOT NULL Length of WARC record
html_title TEXT Extracted from <title> tag
iframe_allowed BOOLEAN True if site allows framing
best_icon_s3_key TEXT S3 key of the chosen icon (denormalized for fast bundle gen)
parsed BOOLEAN DEFAULT FALSE Whether WARC has been parsed

icons table

Column Type Description
id SERIAL PRIMARY KEY Internal ID
host_id INT REFERENCES hosts(id) FK to parent host
url TEXT NOT NULL Full URL to the icon
source TEXT NOT NULL favicon_ico or link_rel
rel_type TEXT MIME type from HTML attribute (if specified)
rel_sizes TEXT Sizes attribute from HTML (if specified)
content_type TEXT Actual MIME type after download
width INT Best usable pixel width (for ICO: largest standard size ≤64; for SVG: NULL)
height INT Best usable pixel height (for ICO: largest standard size ≤64; for SVG: NULL)
file_size INT Size in bytes
s3_key TEXT Key in everytab-icons bucket (SHA-256 of content)
scan_state TEXT DEFAULT 'unscanned' unscanned, in_progress, completed, failed
error TEXT Error message if failed

Indexes:

  • CREATE INDEX idx_icons_unscanned ON icons(id) WHERE scan_state = 'unscanned' — partial index for work claiming. Only indexes unscanned rows; shrinks as work completes. Minimal write overhead since index only updates on transition OUT of 'unscanned'.
  • idx_icons_host_id on (host_id) — for best-icon selection query

S3 Key Strategy: SHA-256 hash of the downloaded icon content. This gives free dedup at the storage layer — if two sites serve the exact same favicon bytes, we store it once. The hash is computed client-side (by the Go downloader) and used as the key. Before uploading, check if the key exists; if so, skip the upload but still record the s3_key in the icons table.

Bundle JSON format (tabs/{n}.json)

{
  "entries": [
    {
      "host": "example.com",
      "title": "Example Domain",
      "icon": "iVBORw0KGgo...",
      "icon_w": 32,
      "icon_h": 32,
      "iframe_ok": true
    },
    {
      "host": "no-favicon-site.org",
      "title": "A Site Without Favicon",
      "icon": "",
      "iframe_ok": false
    }
  ]
}

Icons are stored inline as base64-encoded PNG. Hosts without favicons are included (with "icon": "") as long as they have a title. CloudFront serves bundles with Brotli compression, which significantly reduces transfer size of base64 data.

Bundle size is parameterized (ENTRIES_PER_BUNDLE). Target: enough entries to fill a viewport plus scroll buffer. Initial estimate ~100-150 entries (~150-300KB uncompressed, smaller after Brotli). Will be tuned empirically once the frontend is built and we can measure how many tabs fill a screen.

Pipeline Stages

The pipeline is a series of manually-run scripts executed in order on the single EC2 instance. Each stage is idempotent and resumable.

Stage 1: CC-Index Query

Tool: DuckDB with httpfs extension (query CC parquet directly from S3; if >1hr, fall back to downloading parquet locally first)

Input: Common Crawl columnar index (parquet files on s3://commoncrawl/cc-index/...)

Query logic:

WHERE url_path = '/'
  AND content_mime_type = 'text/html'
  AND fetch_status = 200
  AND url_query IS NULL
  AND url_protocol IN ('http', 'https')
  AND url_port IS NULL

Deduplication: Per hostname, prefer https over http. Result is one row per unique hostname.

Output: Populates hosts table in RDS (~30M rows for a full crawl).

Cost: $0 — Common Crawl is part of the AWS Open Data Registry. S3 GET requests and data transfer within us-east-1 are free.

Stats emitted: Total domains found, https vs http breakdown, duplicates removed.

Stage 2: WARC Parsing

Tool: Custom Go program, highly concurrent

Input: hosts table rows where parsed = FALSE

Process:

  1. Read batches of unparsed rows (cursor-based pagination by ID)
  2. For each row, make a byte-range GET request to Common Crawl's S3:
    • Range: bytes={offset}-{offset+length-1}
    • Target: https://data.commoncrawl.org/{warc_filename}
  3. Parse the WARC record to extract the HTTP response
  4. From HTTP response headers: check for X-Frame-Options and Content-Security-Policy frame-ancestors
  5. Parse HTML defensively (lenient parser, handle malformed HTML):
    • Extract <title> tag content
    • Extract ALL <link rel="icon"> / <link rel="shortcut icon"> entries with their href, type, and sizes attributes
  6. Insert a /favicon.ico entry into icons for every host (protocol://hostname/favicon.ico)
  7. Insert all discovered link rel="icon" entries into icons (any format: ICO, PNG, GIF, SVG, WebP, JPEG)
  8. Update hosts row: html_title, iframe_allowed, parsed = TRUE

Concurrency: High — thousands of goroutines with a semaphore/pool. CC's S3 handles massive throughput.

Error handling: Malformed HTML → still extract what we can (partial title, partial icons). WARC fetch failure → log and skip (mark parsed = TRUE with NULL title to avoid retry loops). All errors logged with hostname for investigation.

Icon URL handling: Relative URLs resolved against {protocol}://{hostname}/. Absolute URLs kept as-is. Data URIs ignored.

No scan_state needed: CC's S3 is highly reliable. The parsed boolean is sufficient. If the process crashes mid-batch, re-run picks up where it left off (unparsed rows).

Cost: $0 (same Open Data program).

Stats emitted: Rows processed, titles extracted, icons found (by source: favicon_ico vs link_rel), icon format distribution, iframe restrictions found, parse failures, rows with no title.

Stage 3: Icon Download

Tool: Custom Go program, highly concurrent

Prerequisite: Unbound running as system resolver on the EC2 instance.

Input: icons table rows where scan_state = 'unscanned' and icon is worth downloading:

  • All favicon_ico entries (always attempt)
  • link_rel entries with no declared size (unknown, could be useful)
  • link_rel entries with declared size ≤64x64
  • Skip link_rel entries with declared size >64x64 (192x192, 180x180, 152x152, etc. — apple-touch-icon bloat we won't use at tab scale)

Process:

  1. Claim batch (randomized to spread load across hosts):
    UPDATE icons SET scan_state = 'in_progress'
    WHERE id IN (
      SELECT id FROM icons
      WHERE scan_state = 'unscanned'
      ORDER BY md5(id::text)  -- deterministic shuffle: spreads hosts apart
      LIMIT N
      FOR UPDATE SKIP LOCKED
    ) RETURNING *;
    
    This ensures requests to the same domain aren't back-to-back. With 30M+ icons from different hosts, a random batch of 1000 almost never contains two icons from the same server.
  2. For each icon URL:
    • Make HTTP(S) GET request (standard Go HTTP client — DNS transparently goes through Unbound)
    • Enforce timeouts: 5s connect, 10s total
    • Enforce max download size: 512KB (generous for icons, but prevents abuse)
    • On success:
      • Validate magic bytes (is this actually an image?)
      • Decode to get dimensions:
        • PNG/GIF/WebP/JPEG/BMP: read image headers for width/height
        • ICO: parse ICO header, find largest embedded size ≤64x64 at a standard dimension (16/32/48/64), store THAT size in width/height
        • SVG: store width=NULL, height=NULL (vector, no pixel size)
      • Compute SHA-256 of content
      • Upload to S3 everytab-icons/{sha256} (skip if key already exists — dedup)
      • Update icons row: s3_key, content_type (from actual data, not HTTP header), width, height, file_size, scan_state = 'completed'
    • On failure: scan_state = 'failed', error = reason

Concurrency: Goroutine pool with configurable size (start 1000, tune based on system resources). Semaphore pattern for backpressure. Monitor memory usage.

Fast failure strategy:

  • DNS failure → fail immediately (Unbound will cache NXDOMAIN)
  • Connection refused → fail immediately
  • Timeout → fail after deadline (no retry)
  • Too large → abort read at 512KB boundary
  • Not an image → fail (record content-type in error)

Permissive on format: Download everything — ICO, PNG, GIF, SVG, WebP, JPEG, BMP, whatever the server returns. Store the raw bytes in S3. Format filtering and conversion happens later in bundle generation.

Scaling to fleet (if needed):

  • Multiple EC2 instances run the same binary
  • Each claims work via Postgres row-level locking (FOR UPDATE SKIP LOCKED)
  • No coordinator needed — linear scaling with instance count

Stats emitted: Icons attempted, completed, failed (breakdown by error type: DNS, timeout, connection refused, HTTP 4xx, HTTP 5xx, invalid image, too large), icons/sec rate, bytes downloaded, unique S3 keys (dedup hits).

Stage 4: Best Icon Selection

Tool: SQL script

Process: For each host, select the best icon from all its completed downloads.

Selection priority (decision flow):

  1. Standard square sizes (32x32, 64x64, 48x48, 16x16) — ideal for tab display. Prefer larger.
  2. Other square sizes ≤64px — close enough. Prefer larger.
  3. Non-square but both dimensions ≤64px — acceptable. Prefer larger.
  4. Everything else (180x180, 192x192, SVG with no dimensions, etc.) — last resort, will be downscaled in bundle generation.

Within the same tier: prefer PNG/GIF/ICO over WebP over SVG, then smaller file size as tiebreaker.

Does not distinguish between favicon_ico and link_rel sources — purely based on what was actually downloaded and its dimensions/format.

Uses DISTINCT ON (host_id) for efficient single-pass selection. See pipeline/04_best_icon/select.sql.

Note on SVG/WebP: Lower priority because rasterizing SVG adds complexity and WebP-to-PNG re-encoding may increase size. Only selected when no raster alternatives exist.

Stats emitted: Hosts with icons selected, hosts without any icon.

Stage 5: Bundle Generation

Tool: Custom Go program (multi-threaded for image processing)

Input: All hosts where html_title IS NOT NULL (include hosts without icons)

Process:

  1. Query all qualifying hosts from RDS (with their best_icon_s3_key)
  2. Randomize the full result set
  3. For each host with an icon (best_icon_s3_key IS NOT NULL):
    • Download from S3 everytab-icons/{s3_key}
    • Decode the image based on format:
      • ICO: parse container, extract the image at the size recorded in width/height (the largest standard size ≤64x64). ICO can embed BMP or PNG internally — decode whichever is present.
      • PNG: decode directly
      • GIF/WebP/BMP/JPEG: decode to raster
      • SVG: rasterize to 32x32 (use a Go SVG rasterizer library)
    • Re-encode as optimized PNG at original dimensions (never upscale — a 16x16 stays 16x16)
    • Base64-encode the PNG bytes
  4. For hosts without icons: set icon to empty string
  5. Chunk into groups of ENTRIES_PER_BUNDLE entries (parameterized, initially ~100-150, tuned to viewport fill)
  6. Serialize each chunk as JSON, write to S3 everytab-site/tabs/{n}.json
  7. Record total bundle count

Output:

  • tabs/0.json through tabs/{M}.json in S3 everytab-site
  • Total bundle count M
  • stats.json in S3 everytab-site (pipeline statistics)

Stats emitted: Total bundles created, total hosts included (with icon / without icon), average bundle size (bytes), total S3 storage used, icon conversion failures.

Stage 6: Frontend Build

Tool: Simple script or template engine

Process:

  1. Inject const TOTAL_BUNDLES = {M}; into the JS
  2. Write index.html and site.js to S3 everytab-site
  3. Invalidate CloudFront distribution (/*)

Stage 7: Backup & Teardown

Process (manual, with confirmation at each step):

  1. Dump RDS database: pg_dump → transfer to homelab
  2. Sync icons: aws s3 sync s3://everytab-icons/ homelab:/path/to/backup/icons/
  3. Verify backups: confirm pg_dump restores cleanly on homelab, spot-check icon files
  4. Delete RDS instance (skip final snapshot — homelab backup is the source of truth, snapshots cost $0.095/GB-month)
  5. Delete S3 everytab-icons bucket
  6. Terminate EC2 instance

DNS Architecture

Unbound runs on the EC2 instance as the system DNS resolver.

Configuration:

  • Recursive resolver mode (no forwarding to any upstream — resolves from root servers)
  • Listening on 127.0.0.1:53
  • Set as system resolver in /etc/resolv.conf
  • Aggressive caching enabled
  • High min-TTL (3600s) — maximizes cache hits for TLD/popular nameservers
  • High cache size (allocate 1-2GB RAM to Unbound)
  • Prefetch enabled (refresh popular entries before expiry)

Why recursive instead of forwarding: Forwarding to Google/Cloudflare would get us rate-limited at 30M+ lookups. Recursive resolution distributes load across thousands of authoritative nameservers. With caching, the actual external query volume is much lower than 30M (most domains share TLD nameservers, many share CDN nameservers).

Transparent to Go: The Go HTTP client uses the OS resolver, which uses Unbound. No custom transport, no SNI issues, no pre-resolved IPs needed. Standard HTTPS connections with normal hostname verification.

Frontend Architecture

File Structure

  • index.html — minimal HTML shell, inline CSS
  • site.js — tab rendering logic, bundle fetching, interaction (separate file for cleanliness, cached after first load)

Requests Per Visit

  1. GET /index.html — HTML + CSS (<10KB)
  2. GET /site.js — JavaScript (cached indefinitely via content hash in filename or cache headers)
  3. GET /tabs/{random}.json — first bundle (~150-300KB, Brotli-compressed to ~100-200KB)

Subsequent scrolls: one additional /tabs/{n}.json per "page" of tabs.

Tab Rendering

  • Rows of tabs fill the viewport, styled to mimic Firefox browser tabs (v1)
  • Each row has a subtle horizontal marquee animation (CSS @keyframes / animation) at slightly varying speeds
  • Tab density adapts to viewport width (responsive)
  • Each tab shows: favicon (rendered via <img src="data:image/png;base64,...">) + truncated title
  • No-icon tabs: just title text, no icon (Firefox behavior)
  • Enough tabs rendered to fill viewport + buffer below fold (so user can scroll immediately without waiting for next fetch)

Interaction

  • Click tab (iframe_ok=true): Opens an iframe overlay showing the actual site
  • Click tab (iframe_ok=false): Opens site in a new tab (with subtle external-link indicator on the tab)
  • Close overlay: X button or click outside dismisses iframe
  • Scroll down: When approaching the bottom, fetch next random bundle and render more rows

Randomization

  • Seed: Date.now() (milliseconds UTC) — every visitor at a different moment sees different tabs
  • PRNG: seeded random number generator (e.g., mulberry32 or xoshiro) for deterministic sequence from seed
  • Generate random bundle indices in range [0, TOTAL_BUNDLES)
  • Track fetched bundle IDs in a Set to avoid loading duplicates on continued scroll

Future Enhancements (v2+)

  • Browser-specific tab styles (Chrome tabs for Chrome users, Safari for Safari, etc.)
  • Mobile-optimized layout
  • "Search for a site" feature
  • Stats page (how many sites, coverage, etc.)

Statistics & Metadata

Each pipeline stage emits a JSON stats file:

stats/
  01_cc_index.json
  02_warc_parse.json
  03_icon_download.json
  04_best_icon.json
  05_bundle_gen.json

After bundle generation, these are merged into a single stats.json uploaded to everytab-site:

{
  "crawl_id": "CC-MAIN-2026-05",
  "generated_at": "2026-05-17T12:00:00Z",
  "pipeline": {
    "cc_index": {
      "started_at": "2026-05-17T08:00:00Z",
      "finished_at": "2026-05-17T08:42:00Z",
      "duration_seconds": 2520,
      "total_domains": 31245678,
      "https": 28901234,
      "http_only": 2344444,
      "duplicates_removed": 1456789
    },
    "warc_parse": {
      "started_at": "2026-05-17T08:45:00Z",
      "finished_at": "2026-05-17T12:15:00Z",
      "duration_seconds": 12600,
      "processed": 31245678,
      "titles_extracted": 29876543,
      "icons_found": 45678901,
      "iframe_restricted": 12345678,
      "parse_failures": 234567
    },
    "icon_download": {
      "started_at": "2026-05-17T12:20:00Z",
      "finished_at": "2026-05-18T18:30:00Z",
      "duration_seconds": 108600,
      "attempted": 45678901,
      "completed": 38901234,
      "failed_dns": 2345678,
      "failed_timeout": 1234567,
      "failed_http_error": 1567890,
      "failed_invalid_image": 890123,
      "failed_too_large": 12345,
      "unique_icons_stored": 34567890,
      "dedup_hits": 4333344
    },
    "best_icon": {
      "started_at": "2026-05-18T18:35:00Z",
      "finished_at": "2026-05-18T18:40:00Z",
      "duration_seconds": 300,
      "hosts_with_icon": 27654321,
      "hosts_without_icon": 3591357
    },
    "bundles": {
      "started_at": "2026-05-18T18:45:00Z",
      "finished_at": "2026-05-18T20:10:00Z",
      "duration_seconds": 5100,
      "total_bundles": 52341,
      "total_hosts_included": 29876543,
      "hosts_with_icon": 27654321,
      "hosts_without_icon": 2222222,
      "excluded_no_title": 1369135,
      "avg_bundle_size_bytes": 245000
    }
  }
}

This is served publicly at /stats.json on the live site — interesting metadata for visitors and useful for monitoring pipeline health across crawls.

Cost Estimate

Scanning Phase (One-Time per Crawl)

Item Estimate
EC2 c5.xlarge (~24-48hrs) $8-16
RDS db.t3.medium (~48-72hrs including dev time) $3-7
S3 everytab-icons storage (~500GB, prorated to days) $1-3
S3 PUT requests (icon uploads, ~30M) $15
Common Crawl S3 reads (CC-Index + WARCs) $0 (Open Data)
Data transfer (icon downloads from internet, inbound) $0 (inbound free)
Data transfer (backup to homelab, outbound) $5-10
Total ~$32-51

Hosting Phase (Monthly Steady-State)

Item Estimate
S3 everytab-site storage (~10-15GB of bundles) $0.35
CloudFront (free tier: 1TB/month transfer, 10M requests/month) $0
S3 origin requests via CloudFront (heavily cached) $1-3
Total ~$2-4/month

Note: Bundle storage estimate revised down. With ~50K bundles at ~250KB each = ~12.5GB, well under previous estimate since we're targeting viewport-fill (100-150 tabs) not 1MB bundles.

If the site gets significant traffic beyond CloudFront free tier, costs scale with usage — but that's a success problem.

Scaling Strategy

Development Phase (100K domains)

  • Cap CC-Index query to 100K rows
  • Full pipeline runs in minutes
  • Validates end-to-end correctness
  • Frontend development and tab-density tuning

Full Scan (30M domains)

  • Single EC2 instance, high concurrency
  • CC-Index query: <1hr (httpfs) or ~2hrs (download + local query)
  • WARC parsing: 2-6hrs
  • Icon download: 12-48hrs (the long pole)
  • Bundle generation: 1-2hrs
  • Total: ~1-2 days

Fleet Scaling (if single instance is too slow)

  • Spin up N identical EC2 instances running the icon downloader
  • All connect to the same RDS instance
  • Work claiming via FOR UPDATE SKIP LOCKED — no double work, no coordinator
  • Linear throughput scaling: 4 instances ≈ 4x download speed
  • Only the icon download stage benefits from fleet (other stages are fast enough solo)

Key Design Decisions

  1. Static-only hosting — No servers for the live site. Everything pre-built. Minimal attack surface, minimal cost.
  2. Inline icons in bundles — One fetch gives you 100+ tabs to render. No per-icon requests.
  3. Base64 + Brotli — Base64 for browser-native decoding (atob()). Brotli compression at the CDN layer reduces transfer size by ~25-30% for free.
  4. Unbound as system resolver — Transparent to application code. Standard Go HTTP. No custom networking.
  5. SHA-256 content-addressed icon storage — Natural dedup at S3 layer. Same favicon stored once even if referenced by multiple hosts.
  6. Permissive download, selective bundling — Download ALL favicon formats during scanning. Convert to optimized PNG only during bundle generation. Decouples "capture as much as possible" from "serve the best version."
  7. Partial index for work claiming — Indexes only unscanned rows. Shrinks as work progresses. Minimal write amplification.
  8. Two S3 buckets — Clean separation of concerns. Private working storage vs public site. Safe deletion of temporary data.
  9. Per-millisecond random seed — Every visitor sees a unique arrangement. No shared state, no server needed for randomization.
  10. Viewport-sized bundles — ~100-150 tabs per bundle, tuned to fill a screen. Faster loads, smaller memory footprint than 1MB bundles.
  11. Include no-icon hosts — Any host with a title is included. Firefox-style rendering (title only) for hosts without favicons.
  12. Denormalized best_icon_s3_key in hosts — Avoids joins during bundle generation. Written once during icon selection, read once during bundling.