594 lines
30 KiB
Markdown
594 lines
30 KiB
Markdown
# EveryTab Architecture
|
|
|
|
## System Overview
|
|
|
|
EveryTab is a static website that displays a page full of browser tabs representing every website on the internet. The system has two phases:
|
|
|
|
1. **Scanning Phase** — A data pipeline that extracts website metadata from Common Crawl, downloads favicons, and processes them into servable bundles.
|
|
2. **Hosting Phase** — A static site served via S3 + CloudFront that renders tabs using pre-built JSON bundles.
|
|
|
|
The scanning phase runs monthly (triggered by new Common Crawl releases), produces a static site, and then its infrastructure is torn down after backing up data to the homelab. The hosting phase runs indefinitely at minimal cost.
|
|
|
|
## Workflow Diagram
|
|
|
|
```mermaid
|
|
flowchart TD
|
|
subgraph EC2["Scanning Phase (EC2 instance)"]
|
|
A["Stage 1: Query CC-Index via DuckDB"]
|
|
B["Stage 2: Parse WARCs - Go"]
|
|
C["Stage 3: Download Icons - Go"]
|
|
D["Stage 4: Select Best Icons"]
|
|
E["Stage 5: Generate Bundles - Go"]
|
|
F["Stage 6: Deploy Frontend"]
|
|
UB["Unbound - Local recursive resolver"]
|
|
DISK["Local disk - Sharded icon archive"]
|
|
|
|
A --> B --> C --> D --> E --> F
|
|
UB -.-> C
|
|
C --> DISK
|
|
DISK --> E
|
|
end
|
|
|
|
subgraph ExtData["External Data"]
|
|
CC["Common Crawl S3 - Parquet Index + WARCs"]
|
|
end
|
|
|
|
subgraph AWS["AWS Services"]
|
|
RDS[("RDS Postgres - hosts + icons tables")]
|
|
S3S["S3: everytab-site - tabs/*.json + index.html"]
|
|
CF["CloudFront CDN"]
|
|
end
|
|
|
|
subgraph Post["Post-Scan"]
|
|
BAK["Backup to Homelab - RDS dump + icons rsync"]
|
|
TEAR["Teardown - Delete RDS, EC2"]
|
|
end
|
|
|
|
CC --> A
|
|
CC --> B
|
|
A --> RDS
|
|
B --> RDS
|
|
C --> RDS
|
|
D --> RDS
|
|
E --> S3S
|
|
F --> S3S
|
|
S3S --> CF
|
|
|
|
F --> BAK
|
|
BAK --> TEAR
|
|
```
|
|
|
|
**Key point:** DuckDB, Go programs, and Unbound all run on the same EC2 instance. The pipeline is sequential — one stage completes before the next begins.
|
|
|
|
## AWS Infrastructure
|
|
|
|
All resources in **us-east-1**.
|
|
|
|
| Resource | Purpose | Lifecycle |
|
|
|----------|---------|-----------|
|
|
| EC2 (c5.2xlarge) + 1TB EBS | Run all pipeline stages, store icon archive | Scanning only |
|
|
| RDS Postgres (db.m5.large) | Store hosts/icons metadata | Scanning only (backup to homelab, then delete) |
|
|
| S3 `everytab-site` | Static site: index.html, site.js, tabs/*.json | Permanent |
|
|
| CloudFront | CDN for static site (Brotli compression enabled) | Permanent |
|
|
| S3 `everytab-logs` | CloudFront access logs | Permanent |
|
|
| Unbound (on EC2) | Local recursive DNS resolver | Scanning only (runs on EC2) |
|
|
|
|
### Icon Storage
|
|
|
|
Icons are stored on local disk during scanning, not S3. The EBS volume holds the full icon archive in a sharded directory structure (`ab/cd/ef/{sha256}`). This avoids ~$175 in S3 PUT costs at 30M scale. After scanning completes, icons are backed up to the homelab via rsync.
|
|
|
|
### Steady-State (Hosting Only)
|
|
- S3 `everytab-site` — index.html + site.js + ~250K JSON bundles
|
|
- CloudFront distribution — Brotli-compressed delivery, caching
|
|
|
|
## Data Model
|
|
|
|
### `hosts` table
|
|
|
|
| Column | Type | Description |
|
|
|--------|------|-------------|
|
|
| id | SERIAL PRIMARY KEY | Internal ID |
|
|
| hostname | TEXT NOT NULL UNIQUE | e.g., `example.com` |
|
|
| protocol | TEXT NOT NULL | `https` or `http` (prefer https) |
|
|
| crawl_id | TEXT NOT NULL | CC crawl identifier (e.g., `CC-MAIN-2026-05`) |
|
|
| warc_filename | TEXT NOT NULL | Path to WARC file in CC's S3 |
|
|
| warc_record_offset | BIGINT NOT NULL | Byte offset into WARC file |
|
|
| warc_record_length | INT NOT NULL | Length of WARC record |
|
|
| html_title | TEXT | Extracted from `<title>` tag |
|
|
| iframe_allowed | BOOLEAN | True if site allows framing |
|
|
| best_icon_s3_key | TEXT | SHA-256 hash of the chosen icon file (denormalized for fast bundle gen) |
|
|
| parsed | BOOLEAN DEFAULT FALSE | Whether WARC has been parsed |
|
|
| random_order | DOUBLE PRECISION DEFAULT random() | Random value for shuffled bundle generation pagination |
|
|
|
|
### `icons` table
|
|
|
|
| Column | Type | Description |
|
|
|--------|------|-------------|
|
|
| id | SERIAL PRIMARY KEY | Internal ID |
|
|
| host_id | INT REFERENCES hosts(id) | FK to parent host |
|
|
| url | TEXT NOT NULL | Full URL to the icon |
|
|
| source | TEXT NOT NULL | `favicon_ico` or `link_rel` |
|
|
| rel_type | TEXT | MIME type from HTML attribute (if specified) |
|
|
| rel_sizes | TEXT | Sizes attribute from HTML (if specified) |
|
|
| content_type | TEXT | Actual MIME type after download |
|
|
| width | INT | Best usable pixel width (for ICO: largest standard size ≤64; for SVG: NULL) |
|
|
| height | INT | Best usable pixel height (for ICO: largest standard size ≤64; for SVG: NULL) |
|
|
| file_size | INT | Size in bytes |
|
|
| s3_key | TEXT | SHA-256 hash of content (used as local file path, legacy column name) |
|
|
| scan_state | TEXT DEFAULT 'unscanned' | `unscanned`, `in_progress`, `completed`, `failed` |
|
|
| error | TEXT | Error message if failed |
|
|
| downloaded_at | TIMESTAMPTZ | When the icon was fetched (NULL if not yet downloaded) |
|
|
|
|
**Indexes:**
|
|
- `CREATE INDEX idx_icons_unscanned ON icons(id) WHERE scan_state = 'unscanned'` — partial index for work claiming. Only indexes unscanned rows; shrinks as work completes. Minimal write overhead since index only updates on transition OUT of 'unscanned'.
|
|
- `idx_icons_host_id` on (host_id) — for best-icon selection query
|
|
|
|
**Content-Addressed Storage:** SHA-256 hash of the downloaded icon content, used as the local file path (`ab/cd/ef/{full_hash}`). This gives free dedup — if two sites serve the exact same favicon bytes, we store it once. Before writing, check if the file exists; if so, skip the write but still record the hash in the icons table.
|
|
|
|
### Bundle JSON format (`tabs/{n}.json`)
|
|
|
|
```json
|
|
{
|
|
"entries": [
|
|
{
|
|
"url": "https://example.com",
|
|
"title": "Example Domain",
|
|
"icon": "iVBORw0KGgo...",
|
|
"icon_w": 32,
|
|
"icon_h": 32,
|
|
"iframe_ok": true
|
|
},
|
|
{
|
|
"url": "http://no-favicon-site.org",
|
|
"title": "A Site Without Favicon",
|
|
"icon": "",
|
|
"iframe_ok": false
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
Icons are stored inline as base64-encoded PNG. Hosts without favicons are included (with `"icon": ""`) as long as they have a title. CloudFront serves bundles with Brotli compression, which significantly reduces transfer size of base64 data.
|
|
|
|
Bundle size is parameterized (`ENTRIES_PER_BUNDLE`, default 120). Tuned to fill a viewport plus scroll buffer. Average bundle size ~215KB uncompressed, significantly smaller after Brotli.
|
|
|
|
## Pipeline Stages
|
|
|
|
The pipeline is a series of manually-run scripts executed in order on the single EC2 instance. Each stage is idempotent and resumable.
|
|
|
|
### Stage 1: CC-Index Query
|
|
|
|
**Tool:** DuckDB with `aws` extension (credential chain) to read parquet directly from S3
|
|
|
|
**Input:** Common Crawl columnar index (parquet files on `s3://commoncrawl/cc-index/...`)
|
|
|
|
**Query logic:**
|
|
```sql
|
|
WHERE url_path = '/'
|
|
AND content_mime_type = 'text/html'
|
|
AND fetch_status = 200
|
|
AND url_query IS NULL
|
|
AND url_protocol IN ('http', 'https')
|
|
AND url_port IS NULL
|
|
```
|
|
|
|
**Deduplication:** Per hostname, prefer `https` over `http`. Result is one row per unique hostname.
|
|
|
|
**Output:** Populates `hosts` table in RDS (~30M rows for a full crawl).
|
|
|
|
**Cost:** $0 — Common Crawl is part of the AWS Open Data Registry. S3 GET requests and data transfer within us-east-1 are free.
|
|
|
|
**Stats emitted:** Total domains found, https vs http breakdown, duplicates removed.
|
|
|
|
### Stage 2: WARC Parsing
|
|
|
|
**Tool:** Custom Go program, highly concurrent
|
|
|
|
**Input:** `hosts` table rows where `parsed = FALSE`
|
|
|
|
**Process:**
|
|
1. Read batches of unparsed rows (cursor-based pagination by ID)
|
|
2. For each row, make a byte-range S3 GetObject request to the `commoncrawl` bucket:
|
|
- `Range: bytes={offset}-{offset+length-1}`
|
|
- Uses AWS SDK (not `data.commoncrawl.org` HTTPS endpoint, which rate-limits at ~100 concurrent connections)
|
|
3. Parse the WARC record to extract the HTTP response
|
|
4. From HTTP response headers: check for `X-Frame-Options` and `Content-Security-Policy` frame-ancestors
|
|
5. Parse HTML defensively (lenient parser, handle malformed HTML):
|
|
- Extract `<title>` tag content
|
|
- Extract ALL `<link rel="icon">` / `<link rel="shortcut icon">` entries with their href, type, and sizes attributes
|
|
6. Insert a `/favicon.ico` entry into `icons` for every host (protocol://hostname/favicon.ico)
|
|
7. Insert all discovered `link rel="icon"` entries into `icons` (any format: ICO, PNG, GIF, SVG, WebP, JPEG)
|
|
8. Update `hosts` row: html_title, iframe_allowed, parsed = TRUE
|
|
|
|
**Architecture:** Three-stage pipeline:
|
|
|
|
```
|
|
[DB fetcher] → hostCh → [500 workers] → resultCh → [DB writer with pgx.Batch]
|
|
```
|
|
|
|
1. **DB fetcher** (1 goroutine): continuously pages through unparsed hosts (batch size 5000), feeds `hostCh`.
|
|
2. **Workers** (500 goroutines, configurable): fetch WARC from S3, parse HTML, update stats, send successful results to `resultCh`. I/O-bound on S3 latency.
|
|
3. **DB writer** (1 goroutine): collects results, flushes every 100 using `pgx.Batch` (~400 queries per DB round-trip). S3 retry with 6 attempts and exponential backoff for transient 503s.
|
|
|
|
**Error handling:** Malformed HTML → still extract what we can (partial title, partial icons). WARC fetch failure → log and skip (host stays `parsed = FALSE`, retryable on next run). Max 50 link_rel icons per host (defensive cap against adversarial pages).
|
|
|
|
**Icon URL handling:** Relative URLs resolved against `{protocol}://{hostname}/`. Absolute URLs kept as-is. Data URIs ignored.
|
|
|
|
**Cost:** $0 (same Open Data program).
|
|
|
|
**Stats emitted:** Rows processed, titles extracted, no-title count, icons found, iframe restrictions, fetch/parse errors, DB errors, panics.
|
|
|
|
### Stage 3: Icon Download
|
|
|
|
**Tool:** Custom Go program, highly concurrent
|
|
|
|
**Prerequisite:** Unbound running as system resolver on the EC2 instance.
|
|
|
|
**Input:** ALL `icons` table rows where `scan_state = 'unscanned'` — no size filter. Every `favicon_ico` and `link_rel` icon is downloaded regardless of declared size. The full archive is kept on disk; filtering happens later at best-icon selection and bundle generation.
|
|
|
|
**Process:**
|
|
1. Producer goroutine claims batches via `FOR UPDATE SKIP LOCKED`:
|
|
```sql
|
|
UPDATE icons SET scan_state = 'in_progress'
|
|
WHERE id IN (
|
|
SELECT id FROM icons
|
|
WHERE scan_state = 'unscanned'
|
|
LIMIT 5000
|
|
FOR UPDATE SKIP LOCKED
|
|
) RETURNING id, url;
|
|
```
|
|
Icons are fed into a buffered channel. N worker goroutines consume from the channel, so workers never starve between batch claims.
|
|
2. For each icon URL:
|
|
- Make HTTP(S) GET request (standard Go HTTP client — DNS transparently goes through Unbound)
|
|
- Shared `http.Transport` for connection pooling and TLS session reuse
|
|
- Enforce timeouts: 5s connect, 10s total
|
|
- Enforce max download size: 512KB (generous for icons, but prevents abuse)
|
|
- On success:
|
|
- Validate magic bytes (is this actually an image?)
|
|
- Decode to get dimensions:
|
|
- PNG/GIF/WebP/JPEG/BMP: read image headers for width/height
|
|
- ICO: parse ICO header, find largest embedded size ≤64x64 at a standard dimension (16/32/48/64), store THAT size in width/height
|
|
- SVG: store width=NULL, height=NULL (vector, no pixel size)
|
|
- Compute SHA-256 of content
|
|
- Write to local disk at `{icons_dir}/ab/cd/ef/{sha256}` (skip if file already exists — dedup)
|
|
- Update icons row: s3_key (the SHA-256 hash), content_type (from actual data, not HTTP header), width, height, file_size, scan_state = 'completed'
|
|
- On failure: scan_state = 'failed', error = reason
|
|
|
|
**Concurrency:** Channel-based worker pool (default 2500 workers, configurable). Producer goroutine feeds a buffered channel (buffer = batch size), shuffles each batch to avoid hitting the same host back-to-back. N workers consume from the channel.
|
|
|
|
**Fast failure strategy:**
|
|
- DNS failure → fail immediately (Unbound will cache NXDOMAIN)
|
|
- Connection refused → fail immediately
|
|
- Timeout → fail after deadline (no retry)
|
|
- Too large → abort read at 512KB boundary
|
|
- Not an image → fail (record content-type in error)
|
|
|
|
**Permissive on format:** Download everything — ICO, PNG, GIF, SVG, WebP, JPEG, BMP, whatever the server returns. Store the raw bytes on disk. Format filtering and conversion happens later in bundle generation.
|
|
|
|
**Scaling to fleet (if needed):**
|
|
- Multiple EC2 instances run the same binary
|
|
- Each claims work via Postgres row-level locking (`FOR UPDATE SKIP LOCKED`)
|
|
- No coordinator needed — linear scaling with instance count
|
|
|
|
**Stats emitted:** Icons attempted, completed, failed (breakdown by error type: DNS, timeout, connection refused, HTTP 4xx, HTTP 5xx, invalid image, too large), icons/sec rate, bytes downloaded, dedup hits.
|
|
|
|
### Stage 4: Best Icon Selection
|
|
|
|
**Tool:** SQL script
|
|
|
|
**Process:** For each host, select the best icon from all its completed downloads.
|
|
|
|
**Selection priority (decision flow):**
|
|
|
|
Target: 32x32 source icon. The frontend displays favicons at 16x16 CSS pixels, which is 32x32 physical pixels on 2x Retina screens. So 32x32 is the ideal source resolution — crisp on Retina without wasting bundle space.
|
|
|
|
1. **Icons ≥32px** (preferred): smallest first, so closest to 32 wins. A 32x32 beats a 48x48 beats a 180x180.
|
|
2. **Icons <32px** (fallback): largest first. A 16x16 beats an 8x8.
|
|
3. **Unknown dimensions** (NULL width/height): last resort.
|
|
|
|
Within the same size tier:
|
|
- Prefer PNG > ICO > GIF/JPEG/BMP > WebP
|
|
- Tiebreaker: smaller file size
|
|
|
|
SVGs excluded (can't rasterize without external deps). Icons ≤2x2 excluded (tracking pixels).
|
|
|
|
Does not distinguish between `favicon_ico` and `link_rel` sources — purely based on what was actually downloaded and its dimensions/format.
|
|
|
|
Uses `DISTINCT ON (host_id)` for efficient single-pass selection. See `pipeline/04_best_icon/select.sql`.
|
|
|
|
**Stats emitted:** Hosts with icons selected, hosts without any icon.
|
|
|
|
### Stage 5: Bundle Generation
|
|
|
|
**Tool:** Custom Go program (multi-threaded for image processing)
|
|
|
|
**Input:** All hosts where `html_title IS NOT NULL` (include hosts without icons)
|
|
|
|
**Architecture:** Four-stage pipeline with all stages running concurrently:
|
|
|
|
```
|
|
[DB fetcher] → hostCh → [N converters] → entryCh → [bundle assembler] → uploadCh → [M uploaders]
|
|
```
|
|
|
|
1. **DB fetcher** (1 goroutine): continuously fetches pages of hosts via keyset pagination on `random_order`. Feeds hosts into `hostCh`. Never waits for downstream stages.
|
|
2. **Converter workers** (N goroutines, default 20): read hosts from `hostCh`, read icon from disk, decode, re-encode as PNG, base64-encode, emit `BundleEntry` to `entryCh`. CPU-bound — default tuned to ~5x core count on c5.xlarge (4 vCPUs).
|
|
- Decode via Go's `image.Decode` (handles PNG, GIF, JPEG, WebP, BMP, ICO via registered decoders)
|
|
- SVGs excluded (no rasterizer) — these hosts appear without icons
|
|
- Icons >128px downscaled to 32x32 (nearest-neighbor). Icons ≤128px kept as-is.
|
|
3. **Bundle assembler** (1 goroutine): collects entries from `entryCh`. Every 120 entries (configurable), serializes as JSON and sends to `uploadCh`. Hosts without icons included with `"icon": ""`.
|
|
4. **Upload workers** (M goroutines, default 10): write bundles to S3 (or local disk in dry-run mode). I/O-bound — multiple uploads in flight hides S3 PUT latency (~50-100ms each).
|
|
|
|
Bundles are written in-place (overwriting previous run). No delete-first step, so the live site always has valid data even if bundle gen crashes midway. The frontend's `TOTAL_BUNDLES` constant ensures only valid bundle indices are requested.
|
|
|
|
**Output:**
|
|
- `tabs/0000.json` through `tabs/{M}.json` in S3 `everytab-site`
|
|
- Total bundle count M (bake into frontend via deploy script)
|
|
|
|
**Stats emitted:** Total bundles created, total hosts included (with icon / without icon), average bundle size (bytes), total S3 storage used, icon conversion failures.
|
|
|
|
### Stage 6: Frontend Deploy
|
|
|
|
**Tool:** `pipeline/06_frontend/deploy.sh`
|
|
|
|
**Process:**
|
|
1. `sed` injects `const TOTAL_BUNDLES = {M};` into a temp copy of `index.html`
|
|
2. Uploads `index.html`, `site.js`, `bot.html`, `about.html` to S3 `everytab-site`
|
|
3. Invalidates CloudFront cache for all four files (auto-detects distribution ID)
|
|
|
|
### Stage 7: Backup & Teardown
|
|
|
|
**Process (manual, with confirmation at each step):**
|
|
1. Dump RDS database: `pg_dump -Fc` → transfer to homelab via rsync
|
|
2. Sync icons from local disk: `rsync -avP ~/icons/ homelab:/backups/everytab/icons/`
|
|
3. **Verify backups:** confirm pg_dump restores cleanly on homelab, spot-check icon files
|
|
4. Tear down scanning infra: `terraform apply -var="scanning=false"` (deletes RDS, EC2, icons S3 bucket)
|
|
|
|
## Performance Characteristics
|
|
|
|
Each pipeline stage has different bottlenecks. Understanding these explains the concurrency choices and why certain stages can't be sped up further on a single machine.
|
|
|
|
### Stage 1: CC-Index Query
|
|
- **Download phase: network-bound.** `aws s3 sync` of ~166GB of parquet files. Throughput limited by EC2 network bandwidth (10 Gbps on c5.2xlarge). Takes ~10-15 minutes.
|
|
- **Query phase: memory-bound.** DuckDB loads the GROUP BY hash table into memory. At 30M output rows, the hash table approaches 16GB. `temp_directory` is set to EBS so DuckDB spills to NVMe efficiently (large sequential I/O) rather than relying on OS swap (random 4KB page faults). On c5.2xlarge (16GB RAM) with 8GB swap, the query completes without severe thrashing.
|
|
- **Not CPU-bound** — DuckDB's columnar scan is efficient, CPU cores are underutilized during the query.
|
|
|
|
### Stage 2: WARC Parsing
|
|
- **CPU-bound + network I/O-bound (S3).** Each WARC fetch is a byte-range S3 GetObject request (~100-200ms round-trip), but TLS handshakes + gzip decompression + HTML parsing consume significant CPU. At 500 goroutines on 4 cores, CPU was at 100%. On c5.2xlarge (8 cores), more workers can actually compute simultaneously.
|
|
- **DB writes batched** via `pgx.Batch` — 500 results (~2000 queries) per round-trip. Non-burstable RDS (db.m5.large) provides consistent write performance. Burstable t3 instances throttle under sustained load and cause pipeline stalls via channel back-pressure.
|
|
- **Channel buffers sized to prevent stalls** — hostCh (20K) gives the DB fetcher enough runway between queries. resultCh (1K) absorbs write latency spikes.
|
|
- **S3 retry** with 6 attempts and exponential backoff handles transient 503s from the `commoncrawl` bucket.
|
|
- **Measured: 566 hosts/sec** at concurrency 500 on c5.xlarge (4 cores). Expected ~1000+ hosts/sec on c5.2xlarge (8 cores).
|
|
|
|
### Stage 3: Icon Download
|
|
- **Network I/O-bound (internet).** Downloading from millions of different web servers worldwide. Latency varies wildly (1ms to 10s). The long tail of slow/dead servers dominates — most icons download in <500ms but timeouts (10s) hold workers.
|
|
- **The long pole of the pipeline** — longest stage at 30M scale.
|
|
- **5000 concurrent goroutines** to keep throughput high despite variable latency. Not CPU-bound (magic byte checks and SHA-256 are fast). Not DB-bound (one write per icon at ~1ms, self-smoothing due to random server latencies).
|
|
- **Memory is the concurrency limit** — each goroutine holds a TCP connection + TLS session + icon data buffer. At 5000 workers on c5.2xlarge (16GB), ~2-3GB for connection overhead — comfortable.
|
|
- **Disk I/O is negligible** — icons are small (median ~5KB), writes are sharded across directories.
|
|
- **DNS is cached** — Unbound's aggressive caching (1.7GB cache, 3600s min-TTL) means repeat TLD/nameserver lookups are instant. First-seen domains incur recursive resolution (~50-100ms) but this is pipelined with the HTTP request.
|
|
- **Measured: 439 icons/sec** at concurrency 1000 on c5.xlarge. Expected to improve significantly at 5000 concurrency on c5.2xlarge.
|
|
|
|
### Stage 4: Best Icon Selection
|
|
- **CPU-bound (Postgres).** Single SQL query with `DISTINCT ON` and multi-column sort. Runs in seconds even at 30M — Postgres handles this efficiently with the `idx_icons_host_id` index.
|
|
|
|
### Stage 5: Bundle Generation
|
|
- **CPU-bound (image conversion).** Decoding icons (especially ICO) and re-encoding as PNG is the bottleneck. 40 converter goroutines on c5.2xlarge (8 cores) keep all cores saturated. More goroutines don't help — they just compete for cores.
|
|
- **Disk I/O is secondary** — reading small icon files from the sharded directory. Usually cached in the OS page cache after first access.
|
|
- **S3 uploads are pipelined** — 10 upload workers hide the ~50-100ms PUT latency. The assembler serializes bundles while previous uploads are in flight.
|
|
- **DB reads are pipelined** — the fetcher goroutine prefetches pages while converters work, so workers never wait for DB.
|
|
- **Measured: 2,377 hosts/sec** at concurrency 20 on c5.xlarge (4 cores). Expected ~4500+ hosts/sec at concurrency 40 on c5.2xlarge.
|
|
|
|
### Stage 6: Frontend Deploy
|
|
- **Network-bound.** 4 small file uploads to S3 + CloudFront invalidation. Seconds.
|
|
|
|
### Summary: what would make each stage faster
|
|
|
|
| Stage | Current bottleneck | To speed up further |
|
|
|-------|-------------------|-------------|
|
|
| CC-Index | Memory (DuckDB hash table spill) | Streaming dedup via INSERT ON CONFLICT, or more RAM |
|
|
| WARC parsing | CPU + S3 latency | More cores, or multiple EC2 instances |
|
|
| Icon download | Internet latency (slow/dead servers) | Multiple EC2 instances |
|
|
| Bundle gen | CPU (image decode/encode) | More cores, or better image libraries |
|
|
| Deploy | N/A | Already seconds |
|
|
|
|
## DNS Architecture
|
|
|
|
**Unbound** runs on the EC2 instance as the system DNS resolver.
|
|
|
|
**Configuration:**
|
|
- Recursive resolver mode (no forwarding to any upstream — resolves from root servers)
|
|
- Listening on 127.0.0.1:53
|
|
- Set as system resolver in `/etc/resolv.conf`
|
|
- Aggressive caching enabled
|
|
- High min-TTL (3600s) — maximizes cache hits for TLD/popular nameservers
|
|
- High cache size (allocate 1-2GB RAM to Unbound)
|
|
- Prefetch enabled (refresh popular entries before expiry)
|
|
|
|
**Why recursive instead of forwarding:** Forwarding to Google/Cloudflare would get us rate-limited at 30M+ lookups. Recursive resolution distributes load across thousands of authoritative nameservers. With caching, the actual external query volume is much lower than 30M (most domains share TLD nameservers, many share CDN nameservers).
|
|
|
|
**Transparent to Go:** The Go HTTP client uses the OS resolver, which uses Unbound. No custom transport, no SNI issues, no pre-resolved IPs needed. Standard HTTPS connections with normal hostname verification.
|
|
|
|
## Frontend Architecture
|
|
|
|
### File Structure
|
|
- `index.html` — minimal HTML shell, inline CSS
|
|
- `site.js` — tab rendering logic, bundle fetching, interaction (separate file for cleanliness, cached after first load)
|
|
|
|
### Requests Per Visit
|
|
1. `GET /index.html` — HTML + CSS (<10KB)
|
|
2. `GET /site.js` — JavaScript (cached indefinitely via content hash in filename or cache headers)
|
|
3. `GET /tabs/{random}.json` — first bundle (~150-300KB, Brotli-compressed to ~100-200KB)
|
|
|
|
Subsequent scrolls: one additional `/tabs/{n}.json` per "page" of tabs.
|
|
|
|
### Tab Rendering
|
|
|
|
- Rows of tabs fill the viewport, styled to match the visitor's browser (Chrome, Firefox, Safari — detected via `navigator.userAgent`)
|
|
- Each row has a bidirectional marquee animation at varying speeds (90-150s per cycle), with random stagger to avoid synchronization
|
|
- Tabs duplicated in DOM for seamless marquee loop (`translateX(-50%)`)
|
|
- Each tab shows: favicon (rendered via `<img src="data:image/png;base64,...">`) + truncated title
|
|
- No-icon tabs: just title text, no icon
|
|
- Light mode default, auto-switches to dark mode via `prefers-color-scheme`
|
|
- Hover shows full title as native tooltip
|
|
|
|
### Interaction
|
|
|
|
- **Click tab (iframe_ok=true):** Opens an inline iframe viewer between tab rows (75vh height, pushes content down)
|
|
- **Click tab (iframe_ok=false):** Opens site in a new tab (with `↗` external-link indicator on the tab)
|
|
- **Close viewer:** X button or Escape key. Only one viewer open at a time.
|
|
- **Scroll down:** When approaching the bottom, fetch next random bundle and render more rows
|
|
|
|
### Randomization
|
|
|
|
- Seed: `Date.now()` (milliseconds UTC) — every visitor at a different moment sees different tabs
|
|
- PRNG: seeded random number generator (e.g., mulberry32 or xoshiro) for deterministic sequence from seed
|
|
- Generate random bundle indices in range `[0, TOTAL_BUNDLES)`
|
|
- Track fetched bundle IDs in a `Set` to avoid loading duplicates on continued scroll
|
|
|
|
### Future Enhancements
|
|
- Mobile-optimized layout
|
|
- "Search for a site" feature
|
|
- Stats page (how many sites, coverage, etc.)
|
|
- Performance: IntersectionObserver to pause off-screen marquee rows
|
|
|
|
## Statistics & Metadata
|
|
|
|
Each pipeline stage emits a JSON stats file:
|
|
|
|
```
|
|
stats/
|
|
01_cc_index.json
|
|
02_warc_parse.json
|
|
03_icon_download.json
|
|
04_best_icon.json
|
|
05_bundle_gen.json
|
|
```
|
|
|
|
After bundle generation, these are merged into a single `stats.json` uploaded to `everytab-site`:
|
|
|
|
```json
|
|
{
|
|
"crawl_id": "CC-MAIN-2026-05",
|
|
"generated_at": "2026-05-17T12:00:00Z",
|
|
"pipeline": {
|
|
"cc_index": {
|
|
"started_at": "2026-05-17T08:00:00Z",
|
|
"finished_at": "2026-05-17T08:42:00Z",
|
|
"duration_seconds": 2520,
|
|
"total_domains": 31245678,
|
|
"https": 28901234,
|
|
"http_only": 2344444,
|
|
"duplicates_removed": 1456789
|
|
},
|
|
"warc_parse": {
|
|
"started_at": "2026-05-17T08:45:00Z",
|
|
"finished_at": "2026-05-17T12:15:00Z",
|
|
"duration_seconds": 12600,
|
|
"processed": 31245678,
|
|
"titles_extracted": 29876543,
|
|
"icons_found": 45678901,
|
|
"iframe_restricted": 12345678,
|
|
"parse_failures": 234567
|
|
},
|
|
"icon_download": {
|
|
"started_at": "2026-05-17T12:20:00Z",
|
|
"finished_at": "2026-05-18T18:30:00Z",
|
|
"duration_seconds": 108600,
|
|
"attempted": 45678901,
|
|
"completed": 38901234,
|
|
"failed_dns": 2345678,
|
|
"failed_timeout": 1234567,
|
|
"failed_http_error": 1567890,
|
|
"failed_invalid_image": 890123,
|
|
"failed_too_large": 12345,
|
|
"unique_icons_stored": 34567890,
|
|
"dedup_hits": 4333344
|
|
},
|
|
"best_icon": {
|
|
"started_at": "2026-05-18T18:35:00Z",
|
|
"finished_at": "2026-05-18T18:40:00Z",
|
|
"duration_seconds": 300,
|
|
"hosts_with_icon": 27654321,
|
|
"hosts_without_icon": 3591357
|
|
},
|
|
"bundles": {
|
|
"started_at": "2026-05-18T18:45:00Z",
|
|
"finished_at": "2026-05-18T20:10:00Z",
|
|
"duration_seconds": 5100,
|
|
"total_bundles": 52341,
|
|
"total_hosts_included": 29876543,
|
|
"hosts_with_icon": 27654321,
|
|
"hosts_without_icon": 2222222,
|
|
"excluded_no_title": 1369135,
|
|
"avg_bundle_size_bytes": 245000
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
This is served publicly at `/stats.json` on the live site — interesting metadata for visitors and useful for monitoring pipeline health across crawls.
|
|
|
|
## Cost Estimate
|
|
|
|
### Scanning Phase (One-Time per Crawl)
|
|
|
|
| Item | Estimate |
|
|
|------|----------|
|
|
| EC2 c5.2xlarge (~2-3 days) | $16-24 |
|
|
| EBS 1TB gp3 (~3 days) | $8 |
|
|
| RDS db.m5.large (~3 days) | $12-15 |
|
|
| Common Crawl S3 reads (CC-Index + WARCs) | $0 (Open Data) |
|
|
| Data transfer (icon downloads from internet, inbound) | $0 (inbound free) |
|
|
| Data transfer (backup to homelab, outbound) | $5-59 (depends on icon archive size) |
|
|
| **Total** | **~$41-106** |
|
|
|
|
### Hosting Phase (Monthly Steady-State)
|
|
|
|
| Item | Estimate |
|
|
|------|----------|
|
|
| S3 everytab-site storage (~10-15GB of bundles) | $0.35 |
|
|
| CloudFront (free tier: 1TB/month transfer, 10M requests/month) | $0 |
|
|
| S3 origin requests via CloudFront (heavily cached) | $1-3 |
|
|
| **Total** | **~$2-4/month** |
|
|
|
|
Note: Bundle storage estimate revised down. With ~50K bundles at ~250KB each = ~12.5GB, well under previous estimate since we're targeting viewport-fill (100-150 tabs) not 1MB bundles.
|
|
|
|
If the site gets significant traffic beyond CloudFront free tier, costs scale with usage — but that's a success problem.
|
|
|
|
## Scaling Strategy
|
|
|
|
### Development Phase (100K domains)
|
|
- Cap CC-Index query to 100K rows
|
|
- Full pipeline runs in minutes
|
|
- Validates end-to-end correctness
|
|
- Frontend development and tab-density tuning
|
|
|
|
### Full Scan (30M domains)
|
|
- Single EC2 instance, high concurrency
|
|
- CC-Index query: <1hr (httpfs) or ~2hrs (download + local query)
|
|
- WARC parsing: 2-6hrs
|
|
- Icon download: 12-48hrs (the long pole)
|
|
- Bundle generation: 1-2hrs
|
|
- Total: ~1-2 days
|
|
|
|
### Fleet Scaling (if single instance is too slow)
|
|
- Spin up N identical EC2 instances running the icon downloader
|
|
- All connect to the same RDS instance
|
|
- Work claiming via `FOR UPDATE SKIP LOCKED` — no double work, no coordinator
|
|
- Linear throughput scaling: 4 instances ≈ 4x download speed
|
|
- Only the icon download stage benefits from fleet (other stages are fast enough solo)
|
|
|
|
## Key Design Decisions
|
|
|
|
1. **Static-only hosting** — No servers for the live site. Everything pre-built. Minimal attack surface, minimal cost.
|
|
2. **Inline icons in bundles** — One fetch gives you 100+ tabs to render. No per-icon requests.
|
|
3. **Base64 + Brotli** — Base64 for browser-native decoding (`atob()`). Brotli compression at the CDN layer reduces transfer size by ~25-30% for free.
|
|
4. **Unbound as system resolver** — Transparent to application code. Standard Go HTTP. No custom networking.
|
|
5. **SHA-256 content-addressed icon storage** — Natural dedup on local disk. Same favicon stored once even if referenced by multiple hosts.
|
|
6. **Permissive download, selective bundling** — Download ALL favicon formats and sizes during scanning. Convert to optimized PNG only during bundle generation. Decouples "capture as much as possible" from "serve the best version."
|
|
7. **Partial index for work claiming** — Indexes only unscanned rows. Shrinks as work progresses. Minimal write amplification.
|
|
8. **Local disk for icons, S3 for site** — Icons stored on EBS during scanning (avoids ~$175 in S3 PUT costs at 30M scale). Only the static site lives in S3 behind CloudFront.
|
|
9. **Per-millisecond random seed** — Every visitor sees a unique arrangement. No shared state, no server needed for randomization.
|
|
10. **Viewport-sized bundles** — ~100-150 tabs per bundle, tuned to fill a screen. Faster loads, smaller memory footprint than 1MB bundles.
|
|
11. **Include no-icon hosts** — Any host with a title is included. Firefox-style rendering (title only) for hosts without favicons.
|
|
12. **Denormalized best_icon_s3_key in hosts** — Stores the SHA-256 hash of the chosen icon. Avoids joins during bundle generation. Written once during icon selection, read once during bundling.
|