everytab/ARCHITECTURE.md

15 KiB

EveryTab Architecture

System Overview

EveryTab is a static website that displays a page full of browser tabs representing every website on the internet. The system has two phases:

  1. Scanning Phase — A data pipeline that extracts website metadata from Common Crawl, downloads favicons, and processes them into servable bundles.
  2. Hosting Phase — A static site served via S3 + CloudFront that renders tabs using pre-built JSON bundles.

The scanning phase runs monthly (triggered by new Common Crawl releases), produces a static site, and then its infrastructure is torn down. The hosting phase runs indefinitely at minimal cost.

Common Crawl (S3)
       |
       v
[EC2 + DuckDB] ---> [RDS Postgres] ---> [EC2 + Go programs] ---> S3 (icons/)
       |                    |                     |                     |
       |              (hosts, icons              |                     |
       |               tables)                   v                     v
       |                    |            [Bundle Generator] ---> S3 (tabs/*.json)
       |                    |                                          |
       |                    v                                          v
       |             [Backup to homelab]                    S3 (index.html)
       |                                                          |
       v                                                          v
  [Tear down EC2, RDS]                                     [CloudFront CDN]

AWS Infrastructure

All resources in us-east-1.

Resource Purpose Lifecycle
EC2 (xlarge, compute-optimized) Run pipeline stages Scanning only
RDS Postgres (db.t3.medium) Store hosts/icons metadata Scanning only (backup then delete)
S3 everytab-icons Raw downloaded favicons Scanning only (backup then delete)
S3 everytab-site Static site: index.html + tabs/*.json Permanent
CloudFront CDN for static site Permanent
Unbound (on EC2) Local recursive DNS resolver Scanning only

Steady-State (Hosting Only)

  • S3 everytab-site — stores index.html + ~50K JSON bundle files (~60GB total)
  • CloudFront distribution — serves the site with caching

Scanning Phase (Temporary)

  • EC2 instance — runs all processing (no persistent local storage needed beyond OS)
  • RDS — structured data store during pipeline execution
  • S3 everytab-icons — temporary storage for downloaded favicons

Data Model

hosts table

Column Type Description
id SERIAL PRIMARY KEY Internal ID
hostname TEXT NOT NULL e.g., example.com
protocol TEXT NOT NULL https or http (prefer https)
crawl_id TEXT NOT NULL CC crawl identifier (e.g., CC-MAIN-2026-05)
warc_filename TEXT NOT NULL Path to WARC file in CC's S3
warc_record_offset BIGINT NOT NULL Byte offset into WARC file
warc_record_length INT NOT NULL Length of WARC record
html_title TEXT Extracted from <title> tag
iframe_allowed BOOLEAN True if site allows framing (no X-Frame-Options/CSP restriction)
best_icon_id INT REFERENCES icons(id) FK to the chosen icon for bundling
parsed BOOLEAN DEFAULT FALSE Whether WARC has been parsed

Constraints: UNIQUE(hostname) — one row per domain, prefer https over http.

icons table

Column Type Description
id SERIAL PRIMARY KEY Internal ID
host_id INT REFERENCES hosts(id) FK to parent host
url TEXT NOT NULL Full URL to the icon
source TEXT NOT NULL favicon_ico or link_rel
content_type TEXT MIME type after download (image/png, image/x-icon, etc.)
width INT Decoded pixel width
height INT Decoded pixel height
s3_key TEXT Key in everytab-icons bucket
scan_state TEXT DEFAULT 'unscanned' unscanned, in_progress, completed, failed
error TEXT Error message if failed

Indexes:

  • idx_icons_scan_state on (scan_state) — for batch claiming work
  • idx_icons_host_id on (host_id) — for best-icon selection

Bundle JSON format (tabs/0001.json)

{
  "entries": [
    {
      "host": "example.com",
      "title": "Example Domain",
      "icon": "iVBORw0KGgo...",
      "icon_w": 32,
      "icon_h": 32,
      "iframe_ok": true
    }
  ]
}

Icons are stored inline as base64-encoded PNG. Each bundle targets ~1MB, yielding approximately 500-700 entries per bundle depending on icon sizes.

Pipeline Stages

The pipeline is a series of manually-run scripts executed in order. Each stage is idempotent and resumable.

Stage 1: CC-Index Query

Tool: DuckDB with httpfs extension (or local parquet if httpfs takes >1hr)

Input: Common Crawl columnar index (parquet files on CC's S3)

Query logic:

WHERE url_path = '/'
  AND content_mime_type = 'text/html'
  AND fetch_status = 200
  AND url_query IS NULL
  AND url_protocol IN ('http', 'https')
  AND url_port IN (80, 443)

Deduplication: Per hostname, prefer https over http. Result is one row per unique hostname.

Output: Populates hosts table in RDS (~30M rows for a full crawl).

Stats emitted: Total domains found, https vs http breakdown, duplicates removed.

Stage 2: WARC Parsing

Tool: Custom Go program, highly concurrent

Input: hosts table rows where parsed = FALSE

Process:

  1. Claim a batch of rows (set parsed = TRUE optimistically, or use a cursor)
  2. For each row, make a byte-range GET request to Common Crawl's S3:
    • Range: bytes={offset}-{offset+length-1}
    • Target: s3://commoncrawl/{warc_filename}
  3. Parse the WARC record to extract the HTTP response
  4. Parse HTML (defensively — handle malformed HTML, use a lenient parser):
    • Extract <title> tag content
    • Extract <link rel="icon"> href values (filter to png/gif/ico, sizes 16-64px)
    • Check HTTP response headers for X-Frame-Options and CSP frame-ancestors
  5. Insert a /favicon.ico entry into icons for every host (always attempt this)
  6. Insert any qualifying link rel="icon" entries into icons
  7. Update hosts row with html_title, iframe_allowed, parsed = TRUE

Concurrency: High — thousands of goroutines. S3 byte-range requests are the bottleneck; S3 handles 5,500+ GET/s per prefix and WARC files are spread across many prefixes.

Error handling: If HTML is unparseable, mark as parsed with NULL title. If WARC fetch fails, retry once then skip. Log all errors with hostname for investigation.

Stats emitted: Rows processed, titles extracted, icons found (by type), iframe restrictions found, parse failures.

Stage 3: DNS Resolution Setup

Tool: Unbound, installed and configured on EC2

Configuration:

  • Recursive resolver (no forwarding to upstream)
  • Listening on 127.0.0.1:53
  • Aggressive caching enabled
  • High min-TTL (e.g., 3600s) to maximize cache hits across similar domains
  • Configured as system resolver in /etc/resolv.conf

This runs as a background service. No separate "DNS resolution stage" — the Go icon downloader's HTTP requests transparently use Unbound via the OS resolver. Unbound handles recursive resolution and caching.

Why: Downloading 30M+ icons without a local recursive resolver would overwhelm upstream DNS providers and likely get us rate-limited. Unbound resolves from root servers directly, caches aggressively, and handles the load locally.

Stage 4: Icon Download

Tool: Custom Go program, highly concurrent

Input: icons table rows where scan_state = 'unscanned'

Process:

  1. Claim a batch of rows (UPDATE scan_state = 'in_progress' WHERE scan_state = 'unscanned' LIMIT N RETURNING *)
  2. For each icon URL:
    • Make HTTP(S) GET request (normal Go HTTP client, DNS goes through Unbound)
    • Enforce timeout (5s connect, 10s total)
    • Enforce max download size (512KB — generous for icons)
    • On success: validate it's an image (check magic bytes), decode to get dimensions
    • Upload raw bytes to S3 everytab-icons/{hash} (content-addressed)
    • Update icons row: s3_key, content_type, width, height, scan_state = 'completed'
    • On failure: scan_state = 'failed', error = reason

Concurrency: Maximize throughput — goroutine pool with configurable size (start at 1000, tune based on memory/bandwidth). Use semaphore pattern for backpressure.

Fast failure: DNS errors, connection refused, timeouts all fail immediately (no retry for icons — if it's down, it's down). This keeps the long tail short.

Scaling to fleet: If a single instance is insufficient:

  • Multiple EC2 instances run the same binary
  • Each claims work via the scan_state UPDATE (Postgres row-level locking prevents double-work)
  • No coordination needed beyond the shared database

Stats emitted: Icons attempted, completed, failed (by error type: DNS, timeout, HTTP error, invalid image, too large), download rate (icons/sec), bytes downloaded.

Stage 5: Best Icon Selection

Tool: SQL query or small script

Process: For each host, select the best icon from its completed icons:

  1. Filter to standard sizes: 16x16, 32x32, 48x48, 64x64
  2. Among those, pick the largest dimensions (prefer 64 > 48 > 32 > 16)
  3. If no standard sizes found, pick the largest icon with dimensions <= 64px on both axes
  4. If no icons at all, host gets a NULL best_icon_id (will use default in frontend)
UPDATE hosts h SET best_icon_id = (
  SELECT id FROM icons i
  WHERE i.host_id = h.id AND i.scan_state = 'completed'
  ORDER BY
    (width IN (16,32,48,64) AND height IN (16,32,48,64)) DESC,
    width DESC
  LIMIT 1
);

Stats emitted: Hosts with icons, hosts without icons, icon size distribution.

Stage 6: Bundle Generation

Tool: Custom Go program

Input: hosts table (joined with their best icon from S3)

Process:

  1. Query all hosts where best_icon_id IS NOT NULL (or include no-icon hosts with a default flag)
  2. Randomize the full result set (ORDER BY random() or shuffle in memory)
  3. For each host:
    • Download its best icon from S3 everytab-icons
    • Decode the icon (ICO/GIF/PNG/etc.)
    • For ICO files: extract the largest embedded image at a standard size <= 64x64
    • Re-encode as PNG (optimized compression)
    • Base64-encode the PNG bytes
  4. Chunk into groups of N entries (~500-700, tuned so each JSON is ~1MB)
  5. Write each chunk as tabs/{n}.json to S3 everytab-site
  6. Record total bundle count

Output:

  • tabs/0000.json through tabs/{M}.json in S3
  • Total bundle count M (used in frontend build)

Stats emitted: Total bundles created, total hosts included, total hosts excluded (no icon), average bundle size, total S3 storage used.

Stage 7: Frontend Build

Tool: Script/template that produces index.html

Process:

  1. Inject TOTAL_BUNDLES constant into the JS (baked at build time)
  2. Minify if desired
  3. Upload index.html to S3 everytab-site root

Stage 8: CloudFront Invalidation

Invalidate /* on the CloudFront distribution so the new site is live.

Stage 9: Backup & Teardown

Process:

  1. Dump RDS database to local machine (homelab) — pg_dump over SSH tunnel or direct
  2. Sync S3 everytab-icons to homelab storage — aws s3 sync
  3. Confirm backups are complete
  4. Delete RDS instance
  5. Delete S3 everytab-icons bucket
  6. Terminate EC2 instance

Frontend Architecture

Single-File Design

One index.html containing inline CSS and JS. No external dependencies, no framework. Two HTTP requests per initial page load:

  1. GET /index.html (HTML + CSS + JS, likely <50KB)
  2. GET /tabs/{random}.json (~1MB, one bundle of ~500-700 tabs)

Tab Rendering

  • Tabs fill the viewport in rows, styled to mimic Firefox browser tabs (v1)
  • Each row has a slight horizontal marquee animation (CSS) at varying speeds
  • Tab density adapts to viewport width (responsive)
  • Each tab shows: favicon (or blank for no-icon) + truncated title

Interaction

  • Click tab (iframe_ok=true): Opens an iframe overlay showing the actual site
  • Click tab (iframe_ok=false): Opens site in a new tab (with external link indicator)
  • Close: X button or click-away dismisses the iframe/overlay
  • Scroll down: Triggers fetch of additional random bundles (infinite scroll)

Randomization

  • Seed: current UTC date (so everyone on the same day sees the same "shuffle", but it changes daily)
  • Generate random bundle index in range [0, TOTAL_BUNDLES)
  • Track fetched bundle IDs in a Set to avoid duplicates on scroll

No-Icon Hosts

Hosts without a favicon are included in bundles with "icon": null. Frontend renders these Firefox-style: just the title text with no icon. This matches Firefox's behavior for tabs without favicons.

Cost Estimate

Scanning Phase (One-Time per Crawl)

Item Estimate
EC2 c5.xlarge (~24-48hrs) $8-16
RDS db.t3.medium (~48hrs) $3-5
S3 icons storage (temporary, ~500GB) $12 (prorated to days)
S3 GET requests (30M WARC reads) $12
Data transfer (icon downloads, ~500GB inbound) $0 (inbound is free)
Total ~$35-45

Hosting Phase (Monthly Steady-State)

Item Estimate
S3 storage (~60GB bundles) $1.40
CloudFront (free tier: 1TB/month, 10M requests) $0*
S3 requests (via CloudFront origin pulls, cached) ~$1-5
Total ~$3-10/month

*CloudFront free tier covers moderate traffic. Costs increase if the site goes viral, but that's a good problem to have.

Scaling Strategy

Development (100K domains)

  • Single EC2 instance
  • All stages complete in minutes-to-hours
  • Good for validating the full pipeline end-to-end

Full Scan (30M domains)

  • Single EC2 instance, high concurrency
  • CC-Index query: <1hr
  • WARC parsing: 2-6hrs (limited by S3 request rate)
  • Icon download: 12-48hrs (limited by network + remote server response times)
  • Bundle generation: 1-2hrs

Fleet Scaling (if needed)

  • Spin up N identical EC2 instances running the icon downloader
  • All share the same RDS instance
  • Work claiming via Postgres atomic UPDATEs (no coordinator needed)
  • Linear scaling: 4 instances = ~4x throughput

Key Design Decisions

  1. Static-only hosting — No servers running for the live site. Entire frontend is pre-built.
  2. Inline icons in bundles — No per-icon requests. One bundle fetch gives you ~600 tabs to render.
  3. Unbound as system resolver — Transparent to application code. Go HTTP client works normally; DNS just happens to resolve locally.
  4. Content-addressed icon storage — S3 key is the content hash. Natural dedup at storage layer during scanning (but icons are duplicated across bundles for simplicity).
  5. Resumable pipeline — Each stage uses database state (parsed, scan_state) to track progress. Crash and restart without re-doing completed work.
  6. PNG as universal icon format — All icons converted to PNG for bundles regardless of source format. Smallest file size for small raster images, universally supported in browsers via data URIs.
  7. Date-seeded randomization — Everyone visiting on the same day sees the same tab arrangement, creating a shared experience. Changes daily for freshness.