Joe Lothan cf6d819f1f initial ARCHITECTURE.md document

2026-05-17 12:19:06 -04:00

15 KiB

Raw Blame History

EveryTab Architecture

System Overview

EveryTab is a static website that displays a page full of browser tabs representing every website on the internet. The system has two phases:

Scanning Phase — A data pipeline that extracts website metadata from Common Crawl, downloads favicons, and processes them into servable bundles.
Hosting Phase — A static site served via S3 + CloudFront that renders tabs using pre-built JSON bundles.

The scanning phase runs monthly (triggered by new Common Crawl releases), produces a static site, and then its infrastructure is torn down. The hosting phase runs indefinitely at minimal cost.

Common Crawl (S3)
       |
       v
[EC2 + DuckDB] ---> [RDS Postgres] ---> [EC2 + Go programs] ---> S3 (icons/)
       |                    |                     |                     |
       |              (hosts, icons              |                     |
       |               tables)                   v                     v
       |                    |            [Bundle Generator] ---> S3 (tabs/*.json)
       |                    |                                          |
       |                    v                                          v
       |             [Backup to homelab]                    S3 (index.html)
       |                                                          |
       v                                                          v
  [Tear down EC2, RDS]                                     [CloudFront CDN]

AWS Infrastructure

All resources in us-east-1.

Resource	Purpose	Lifecycle
EC2 (xlarge, compute-optimized)	Run pipeline stages	Scanning only
RDS Postgres (db.t3.medium)	Store hosts/icons metadata	Scanning only (backup then delete)
S3 `everytab-icons`	Raw downloaded favicons	Scanning only (backup then delete)
S3 `everytab-site`	Static site: index.html + tabs/*.json	Permanent
CloudFront	CDN for static site	Permanent
Unbound (on EC2)	Local recursive DNS resolver	Scanning only

Steady-State (Hosting Only)

S3 everytab-site — stores index.html + ~50K JSON bundle files (~60GB total)
CloudFront distribution — serves the site with caching

Scanning Phase (Temporary)

EC2 instance — runs all processing (no persistent local storage needed beyond OS)
RDS — structured data store during pipeline execution
S3 everytab-icons — temporary storage for downloaded favicons

Data Model

`hosts` table

Column	Type	Description
id	SERIAL PRIMARY KEY	Internal ID
hostname	TEXT NOT NULL	e.g., `example.com`
protocol	TEXT NOT NULL	`https` or `http` (prefer https)
crawl_id	TEXT NOT NULL	CC crawl identifier (e.g., `CC-MAIN-2026-05`)
warc_filename	TEXT NOT NULL	Path to WARC file in CC's S3
warc_record_offset	BIGINT NOT NULL	Byte offset into WARC file
warc_record_length	INT NOT NULL	Length of WARC record
html_title	TEXT	Extracted from `<title>` tag
iframe_allowed	BOOLEAN	True if site allows framing (no X-Frame-Options/CSP restriction)
best_icon_id	INT REFERENCES icons(id)	FK to the chosen icon for bundling
parsed	BOOLEAN DEFAULT FALSE	Whether WARC has been parsed

Constraints: UNIQUE(hostname) — one row per domain, prefer https over http.

`icons` table

Column	Type	Description
id	SERIAL PRIMARY KEY	Internal ID
host_id	INT REFERENCES hosts(id)	FK to parent host
url	TEXT NOT NULL	Full URL to the icon
source	TEXT NOT NULL	`favicon_ico` or `link_rel`
content_type	TEXT	MIME type after download (image/png, image/x-icon, etc.)
width	INT	Decoded pixel width
height	INT	Decoded pixel height
s3_key	TEXT	Key in everytab-icons bucket
scan_state	TEXT DEFAULT 'unscanned'	`unscanned`, `in_progress`, `completed`, `failed`
error	TEXT	Error message if failed

Indexes:

idx_icons_scan_state on (scan_state) — for batch claiming work
idx_icons_host_id on (host_id) — for best-icon selection

Bundle JSON format (`tabs/0001.json`)

{
  "entries": [
    {
      "host": "example.com",
      "title": "Example Domain",
      "icon": "iVBORw0KGgo...",
      "icon_w": 32,
      "icon_h": 32,
      "iframe_ok": true
    }
  ]
}

Icons are stored inline as base64-encoded PNG. Each bundle targets ~1MB, yielding approximately 500-700 entries per bundle depending on icon sizes.

Pipeline Stages

The pipeline is a series of manually-run scripts executed in order. Each stage is idempotent and resumable.

Stage 1: CC-Index Query

Tool: DuckDB with httpfs extension (or local parquet if httpfs takes >1hr)

Input: Common Crawl columnar index (parquet files on CC's S3)

Query logic:

WHERE url_path = '/'
  AND content_mime_type = 'text/html'
  AND fetch_status = 200
  AND url_query IS NULL
  AND url_protocol IN ('http', 'https')
  AND url_port IN (80, 443)

Deduplication: Per hostname, prefer https over http. Result is one row per unique hostname.

Output: Populates hosts table in RDS (~30M rows for a full crawl).

Stats emitted: Total domains found, https vs http breakdown, duplicates removed.

Stage 2: WARC Parsing

Tool: Custom Go program, highly concurrent

Input: hosts table rows where parsed = FALSE

Process:

Claim a batch of rows (set parsed = TRUE optimistically, or use a cursor)
For each row, make a byte-range GET request to Common Crawl's S3:
- Range: bytes={offset}-{offset+length-1}
- Target: s3://commoncrawl/{warc_filename}
Parse the WARC record to extract the HTTP response
Parse HTML (defensively — handle malformed HTML, use a lenient parser):
- Extract <title> tag content
- Extract <link rel="icon"> href values (filter to png/gif/ico, sizes 16-64px)
- Check HTTP response headers for X-Frame-Options and CSP frame-ancestors
Insert a /favicon.ico entry into icons for every host (always attempt this)
Insert any qualifying link rel="icon" entries into icons
Update hosts row with html_title, iframe_allowed, parsed = TRUE

Concurrency: High — thousands of goroutines. S3 byte-range requests are the bottleneck; S3 handles 5,500+ GET/s per prefix and WARC files are spread across many prefixes.

Error handling: If HTML is unparseable, mark as parsed with NULL title. If WARC fetch fails, retry once then skip. Log all errors with hostname for investigation.

Stats emitted: Rows processed, titles extracted, icons found (by type), iframe restrictions found, parse failures.

Stage 3: DNS Resolution Setup

Tool: Unbound, installed and configured on EC2

Configuration:

Recursive resolver (no forwarding to upstream)
Listening on 127.0.0.1:53
Aggressive caching enabled
High min-TTL (e.g., 3600s) to maximize cache hits across similar domains
Configured as system resolver in /etc/resolv.conf

This runs as a background service. No separate "DNS resolution stage" — the Go icon downloader's HTTP requests transparently use Unbound via the OS resolver. Unbound handles recursive resolution and caching.

Why: Downloading 30M+ icons without a local recursive resolver would overwhelm upstream DNS providers and likely get us rate-limited. Unbound resolves from root servers directly, caches aggressively, and handles the load locally.

Stage 4: Icon Download

Tool: Custom Go program, highly concurrent

Input: icons table rows where scan_state = 'unscanned'

Process:

Claim a batch of rows (UPDATE scan_state = 'in_progress' WHERE scan_state = 'unscanned' LIMIT N RETURNING *)
For each icon URL:
- Make HTTP(S) GET request (normal Go HTTP client, DNS goes through Unbound)
- Enforce timeout (5s connect, 10s total)
- Enforce max download size (512KB — generous for icons)
- On success: validate it's an image (check magic bytes), decode to get dimensions
- Upload raw bytes to S3 everytab-icons/{hash} (content-addressed)
- Update icons row: s3_key, content_type, width, height, scan_state = 'completed'
- On failure: scan_state = 'failed', error = reason

Concurrency: Maximize throughput — goroutine pool with configurable size (start at 1000, tune based on memory/bandwidth). Use semaphore pattern for backpressure.

Fast failure: DNS errors, connection refused, timeouts all fail immediately (no retry for icons — if it's down, it's down). This keeps the long tail short.

Scaling to fleet: If a single instance is insufficient:

Multiple EC2 instances run the same binary
Each claims work via the scan_state UPDATE (Postgres row-level locking prevents double-work)
No coordination needed beyond the shared database

Stats emitted: Icons attempted, completed, failed (by error type: DNS, timeout, HTTP error, invalid image, too large), download rate (icons/sec), bytes downloaded.

Stage 5: Best Icon Selection

Tool: SQL query or small script

Process: For each host, select the best icon from its completed icons:

Filter to standard sizes: 16x16, 32x32, 48x48, 64x64
Among those, pick the largest dimensions (prefer 64 > 48 > 32 > 16)
If no standard sizes found, pick the largest icon with dimensions <= 64px on both axes
If no icons at all, host gets a NULL best_icon_id (will use default in frontend)

UPDATE hosts h SET best_icon_id = (
  SELECT id FROM icons i
  WHERE i.host_id = h.id AND i.scan_state = 'completed'
  ORDER BY
    (width IN (16,32,48,64) AND height IN (16,32,48,64)) DESC,
    width DESC
  LIMIT 1
);

Stats emitted: Hosts with icons, hosts without icons, icon size distribution.

Stage 6: Bundle Generation

Tool: Custom Go program

Input: hosts table (joined with their best icon from S3)

Process:

Query all hosts where best_icon_id IS NOT NULL (or include no-icon hosts with a default flag)
Randomize the full result set (ORDER BY random() or shuffle in memory)
For each host:
- Download its best icon from S3 everytab-icons
- Decode the icon (ICO/GIF/PNG/etc.)
- For ICO files: extract the largest embedded image at a standard size <= 64x64
- Re-encode as PNG (optimized compression)
- Base64-encode the PNG bytes
Chunk into groups of N entries (~500-700, tuned so each JSON is ~1MB)
Write each chunk as tabs/{n}.json to S3 everytab-site
Record total bundle count

Output:

tabs/0000.json through tabs/{M}.json in S3
Total bundle count M (used in frontend build)

Stats emitted: Total bundles created, total hosts included, total hosts excluded (no icon), average bundle size, total S3 storage used.

Stage 7: Frontend Build

Tool: Script/template that produces index.html

Process:

Inject TOTAL_BUNDLES constant into the JS (baked at build time)
Minify if desired
Upload index.html to S3 everytab-site root

Stage 8: CloudFront Invalidation

Invalidate /* on the CloudFront distribution so the new site is live.

Stage 9: Backup & Teardown

Process:

Dump RDS database to local machine (homelab) — pg_dump over SSH tunnel or direct
Sync S3 everytab-icons to homelab storage — aws s3 sync
Confirm backups are complete
Delete RDS instance
Delete S3 everytab-icons bucket
Terminate EC2 instance

Frontend Architecture

Single-File Design

One index.html containing inline CSS and JS. No external dependencies, no framework. Two HTTP requests per initial page load:

GET /index.html (HTML + CSS + JS, likely <50KB)
GET /tabs/{random}.json (~1MB, one bundle of ~500-700 tabs)

Tab Rendering

Tabs fill the viewport in rows, styled to mimic Firefox browser tabs (v1)
Each row has a slight horizontal marquee animation (CSS) at varying speeds
Tab density adapts to viewport width (responsive)
Each tab shows: favicon (or blank for no-icon) + truncated title

Interaction

Click tab (iframe_ok=true): Opens an iframe overlay showing the actual site
Click tab (iframe_ok=false): Opens site in a new tab (with external link indicator)
Close: X button or click-away dismisses the iframe/overlay
Scroll down: Triggers fetch of additional random bundles (infinite scroll)

Randomization

Seed: current UTC date (so everyone on the same day sees the same "shuffle", but it changes daily)
Generate random bundle index in range [0, TOTAL_BUNDLES)
Track fetched bundle IDs in a Set to avoid duplicates on scroll

No-Icon Hosts

Hosts without a favicon are included in bundles with "icon": null. Frontend renders these Firefox-style: just the title text with no icon. This matches Firefox's behavior for tabs without favicons.

Cost Estimate

Scanning Phase (One-Time per Crawl)

Item	Estimate
EC2 c5.xlarge (~24-48hrs)	$8-16
RDS db.t3.medium (~48hrs)	$3-5
S3 icons storage (temporary, ~500GB)	$12 (prorated to days)
S3 GET requests (30M WARC reads)	$12
Data transfer (icon downloads, ~500GB inbound)	$0 (inbound is free)
Total	~$35-45

Hosting Phase (Monthly Steady-State)

Item	Estimate
S3 storage (~60GB bundles)	$1.40
CloudFront (free tier: 1TB/month, 10M requests)	$0*
S3 requests (via CloudFront origin pulls, cached)	~$1-5
Total	~$3-10/month

*CloudFront free tier covers moderate traffic. Costs increase if the site goes viral, but that's a good problem to have.

Scaling Strategy

Development (100K domains)

Single EC2 instance
All stages complete in minutes-to-hours
Good for validating the full pipeline end-to-end

Full Scan (30M domains)

Single EC2 instance, high concurrency
CC-Index query: <1hr
WARC parsing: 2-6hrs (limited by S3 request rate)
Icon download: 12-48hrs (limited by network + remote server response times)
Bundle generation: 1-2hrs

Fleet Scaling (if needed)

Spin up N identical EC2 instances running the icon downloader
All share the same RDS instance
Work claiming via Postgres atomic UPDATEs (no coordinator needed)
Linear scaling: 4 instances = ~4x throughput

Key Design Decisions

Static-only hosting — No servers running for the live site. Entire frontend is pre-built.
Inline icons in bundles — No per-icon requests. One bundle fetch gives you ~600 tabs to render.
Unbound as system resolver — Transparent to application code. Go HTTP client works normally; DNS just happens to resolve locally.
Content-addressed icon storage — S3 key is the content hash. Natural dedup at storage layer during scanning (but icons are duplicated across bundles for simplicity).
Resumable pipeline — Each stage uses database state (parsed, scan_state) to track progress. Crash and restart without re-doing completed work.
PNG as universal icon format — All icons converted to PNG for bundles regardless of source format. Smallest file size for small raster images, universally supported in browsers via data URIs.
Date-seeded randomization — Everyone visiting on the same day sees the same tab arrangement, creating a shared experience. Changes daily for freshness.

15 KiB Raw Blame History

EveryTab Architecture

System Overview

AWS Infrastructure

Steady-State (Hosting Only)

Scanning Phase (Temporary)

Data Model

hosts table

icons table

Bundle JSON format (tabs/0001.json)

Pipeline Stages

Stage 1: CC-Index Query

Stage 2: WARC Parsing

Stage 3: DNS Resolution Setup

Stage 4: Icon Download

Stage 5: Best Icon Selection

Stage 6: Bundle Generation

Stage 7: Frontend Build

Stage 8: CloudFront Invalidation

Stage 9: Backup & Teardown

Frontend Architecture

Single-File Design

Tab Rendering

Interaction

Randomization

No-Icon Hosts

Cost Estimate

Scanning Phase (One-Time per Crawl)

Hosting Phase (Monthly Steady-State)

Scaling Strategy

Development (100K domains)

Full Scan (30M domains)

Fleet Scaling (if needed)

Key Design Decisions

15 KiB

Raw Blame History

`hosts` table

`icons` table

Bundle JSON format (`tabs/0001.json`)