360 lines
15 KiB
Markdown
360 lines
15 KiB
Markdown
# EveryTab Architecture
|
|
|
|
## System Overview
|
|
|
|
EveryTab is a static website that displays a page full of browser tabs representing every website on the internet. The system has two phases:
|
|
|
|
1. **Scanning Phase** — A data pipeline that extracts website metadata from Common Crawl, downloads favicons, and processes them into servable bundles.
|
|
2. **Hosting Phase** — A static site served via S3 + CloudFront that renders tabs using pre-built JSON bundles.
|
|
|
|
The scanning phase runs monthly (triggered by new Common Crawl releases), produces a static site, and then its infrastructure is torn down. The hosting phase runs indefinitely at minimal cost.
|
|
|
|
```
|
|
Common Crawl (S3)
|
|
|
|
|
v
|
|
[EC2 + DuckDB] ---> [RDS Postgres] ---> [EC2 + Go programs] ---> S3 (icons/)
|
|
| | | |
|
|
| (hosts, icons | |
|
|
| tables) v v
|
|
| | [Bundle Generator] ---> S3 (tabs/*.json)
|
|
| | |
|
|
| v v
|
|
| [Backup to homelab] S3 (index.html)
|
|
| |
|
|
v v
|
|
[Tear down EC2, RDS] [CloudFront CDN]
|
|
```
|
|
|
|
## AWS Infrastructure
|
|
|
|
All resources in **us-east-1**.
|
|
|
|
| Resource | Purpose | Lifecycle |
|
|
|----------|---------|-----------|
|
|
| EC2 (xlarge, compute-optimized) | Run pipeline stages | Scanning only |
|
|
| RDS Postgres (db.t3.medium) | Store hosts/icons metadata | Scanning only (backup then delete) |
|
|
| S3 `everytab-icons` | Raw downloaded favicons | Scanning only (backup then delete) |
|
|
| S3 `everytab-site` | Static site: index.html + tabs/*.json | Permanent |
|
|
| CloudFront | CDN for static site | Permanent |
|
|
| Unbound (on EC2) | Local recursive DNS resolver | Scanning only |
|
|
|
|
### Steady-State (Hosting Only)
|
|
- S3 `everytab-site` — stores index.html + ~50K JSON bundle files (~60GB total)
|
|
- CloudFront distribution — serves the site with caching
|
|
|
|
### Scanning Phase (Temporary)
|
|
- EC2 instance — runs all processing (no persistent local storage needed beyond OS)
|
|
- RDS — structured data store during pipeline execution
|
|
- S3 `everytab-icons` — temporary storage for downloaded favicons
|
|
|
|
## Data Model
|
|
|
|
### `hosts` table
|
|
|
|
| Column | Type | Description |
|
|
|--------|------|-------------|
|
|
| id | SERIAL PRIMARY KEY | Internal ID |
|
|
| hostname | TEXT NOT NULL | e.g., `example.com` |
|
|
| protocol | TEXT NOT NULL | `https` or `http` (prefer https) |
|
|
| crawl_id | TEXT NOT NULL | CC crawl identifier (e.g., `CC-MAIN-2026-05`) |
|
|
| warc_filename | TEXT NOT NULL | Path to WARC file in CC's S3 |
|
|
| warc_record_offset | BIGINT NOT NULL | Byte offset into WARC file |
|
|
| warc_record_length | INT NOT NULL | Length of WARC record |
|
|
| html_title | TEXT | Extracted from `<title>` tag |
|
|
| iframe_allowed | BOOLEAN | True if site allows framing (no X-Frame-Options/CSP restriction) |
|
|
| best_icon_id | INT REFERENCES icons(id) | FK to the chosen icon for bundling |
|
|
| parsed | BOOLEAN DEFAULT FALSE | Whether WARC has been parsed |
|
|
|
|
**Constraints:** UNIQUE(hostname) — one row per domain, prefer https over http.
|
|
|
|
### `icons` table
|
|
|
|
| Column | Type | Description |
|
|
|--------|------|-------------|
|
|
| id | SERIAL PRIMARY KEY | Internal ID |
|
|
| host_id | INT REFERENCES hosts(id) | FK to parent host |
|
|
| url | TEXT NOT NULL | Full URL to the icon |
|
|
| source | TEXT NOT NULL | `favicon_ico` or `link_rel` |
|
|
| content_type | TEXT | MIME type after download (image/png, image/x-icon, etc.) |
|
|
| width | INT | Decoded pixel width |
|
|
| height | INT | Decoded pixel height |
|
|
| s3_key | TEXT | Key in everytab-icons bucket |
|
|
| scan_state | TEXT DEFAULT 'unscanned' | `unscanned`, `in_progress`, `completed`, `failed` |
|
|
| error | TEXT | Error message if failed |
|
|
|
|
**Indexes:**
|
|
- `idx_icons_scan_state` on (scan_state) — for batch claiming work
|
|
- `idx_icons_host_id` on (host_id) — for best-icon selection
|
|
|
|
### Bundle JSON format (`tabs/0001.json`)
|
|
|
|
```json
|
|
{
|
|
"entries": [
|
|
{
|
|
"host": "example.com",
|
|
"title": "Example Domain",
|
|
"icon": "iVBORw0KGgo...",
|
|
"icon_w": 32,
|
|
"icon_h": 32,
|
|
"iframe_ok": true
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
Icons are stored inline as base64-encoded PNG. Each bundle targets ~1MB, yielding approximately 500-700 entries per bundle depending on icon sizes.
|
|
|
|
## Pipeline Stages
|
|
|
|
The pipeline is a series of manually-run scripts executed in order. Each stage is idempotent and resumable.
|
|
|
|
### Stage 1: CC-Index Query
|
|
|
|
**Tool:** DuckDB with httpfs extension (or local parquet if httpfs takes >1hr)
|
|
|
|
**Input:** Common Crawl columnar index (parquet files on CC's S3)
|
|
|
|
**Query logic:**
|
|
```sql
|
|
WHERE url_path = '/'
|
|
AND content_mime_type = 'text/html'
|
|
AND fetch_status = 200
|
|
AND url_query IS NULL
|
|
AND url_protocol IN ('http', 'https')
|
|
AND url_port IN (80, 443)
|
|
```
|
|
|
|
**Deduplication:** Per hostname, prefer `https` over `http`. Result is one row per unique hostname.
|
|
|
|
**Output:** Populates `hosts` table in RDS (~30M rows for a full crawl).
|
|
|
|
**Stats emitted:** Total domains found, https vs http breakdown, duplicates removed.
|
|
|
|
### Stage 2: WARC Parsing
|
|
|
|
**Tool:** Custom Go program, highly concurrent
|
|
|
|
**Input:** `hosts` table rows where `parsed = FALSE`
|
|
|
|
**Process:**
|
|
1. Claim a batch of rows (set parsed = TRUE optimistically, or use a cursor)
|
|
2. For each row, make a byte-range GET request to Common Crawl's S3:
|
|
- `Range: bytes={offset}-{offset+length-1}`
|
|
- Target: `s3://commoncrawl/{warc_filename}`
|
|
3. Parse the WARC record to extract the HTTP response
|
|
4. Parse HTML (defensively — handle malformed HTML, use a lenient parser):
|
|
- Extract `<title>` tag content
|
|
- Extract `<link rel="icon">` href values (filter to png/gif/ico, sizes 16-64px)
|
|
- Check HTTP response headers for `X-Frame-Options` and CSP `frame-ancestors`
|
|
5. Insert a `/favicon.ico` entry into `icons` for every host (always attempt this)
|
|
6. Insert any qualifying `link rel="icon"` entries into `icons`
|
|
7. Update `hosts` row with `html_title`, `iframe_allowed`, `parsed = TRUE`
|
|
|
|
**Concurrency:** High — thousands of goroutines. S3 byte-range requests are the bottleneck; S3 handles 5,500+ GET/s per prefix and WARC files are spread across many prefixes.
|
|
|
|
**Error handling:** If HTML is unparseable, mark as parsed with NULL title. If WARC fetch fails, retry once then skip. Log all errors with hostname for investigation.
|
|
|
|
**Stats emitted:** Rows processed, titles extracted, icons found (by type), iframe restrictions found, parse failures.
|
|
|
|
### Stage 3: DNS Resolution Setup
|
|
|
|
**Tool:** Unbound, installed and configured on EC2
|
|
|
|
**Configuration:**
|
|
- Recursive resolver (no forwarding to upstream)
|
|
- Listening on 127.0.0.1:53
|
|
- Aggressive caching enabled
|
|
- High min-TTL (e.g., 3600s) to maximize cache hits across similar domains
|
|
- Configured as system resolver in `/etc/resolv.conf`
|
|
|
|
This runs as a background service. No separate "DNS resolution stage" — the Go icon downloader's HTTP requests transparently use Unbound via the OS resolver. Unbound handles recursive resolution and caching.
|
|
|
|
**Why:** Downloading 30M+ icons without a local recursive resolver would overwhelm upstream DNS providers and likely get us rate-limited. Unbound resolves from root servers directly, caches aggressively, and handles the load locally.
|
|
|
|
### Stage 4: Icon Download
|
|
|
|
**Tool:** Custom Go program, highly concurrent
|
|
|
|
**Input:** `icons` table rows where `scan_state = 'unscanned'`
|
|
|
|
**Process:**
|
|
1. Claim a batch of rows (UPDATE scan_state = 'in_progress' WHERE scan_state = 'unscanned' LIMIT N RETURNING *)
|
|
2. For each icon URL:
|
|
- Make HTTP(S) GET request (normal Go HTTP client, DNS goes through Unbound)
|
|
- Enforce timeout (5s connect, 10s total)
|
|
- Enforce max download size (512KB — generous for icons)
|
|
- On success: validate it's an image (check magic bytes), decode to get dimensions
|
|
- Upload raw bytes to S3 `everytab-icons/{hash}` (content-addressed)
|
|
- Update `icons` row: s3_key, content_type, width, height, scan_state = 'completed'
|
|
- On failure: scan_state = 'failed', error = reason
|
|
|
|
**Concurrency:** Maximize throughput — goroutine pool with configurable size (start at 1000, tune based on memory/bandwidth). Use semaphore pattern for backpressure.
|
|
|
|
**Fast failure:** DNS errors, connection refused, timeouts all fail immediately (no retry for icons — if it's down, it's down). This keeps the long tail short.
|
|
|
|
**Scaling to fleet:** If a single instance is insufficient:
|
|
- Multiple EC2 instances run the same binary
|
|
- Each claims work via the `scan_state` UPDATE (Postgres row-level locking prevents double-work)
|
|
- No coordination needed beyond the shared database
|
|
|
|
**Stats emitted:** Icons attempted, completed, failed (by error type: DNS, timeout, HTTP error, invalid image, too large), download rate (icons/sec), bytes downloaded.
|
|
|
|
### Stage 5: Best Icon Selection
|
|
|
|
**Tool:** SQL query or small script
|
|
|
|
**Process:**
|
|
For each host, select the best icon from its completed icons:
|
|
1. Filter to standard sizes: 16x16, 32x32, 48x48, 64x64
|
|
2. Among those, pick the largest dimensions (prefer 64 > 48 > 32 > 16)
|
|
3. If no standard sizes found, pick the largest icon with dimensions <= 64px on both axes
|
|
4. If no icons at all, host gets a NULL best_icon_id (will use default in frontend)
|
|
|
|
```sql
|
|
UPDATE hosts h SET best_icon_id = (
|
|
SELECT id FROM icons i
|
|
WHERE i.host_id = h.id AND i.scan_state = 'completed'
|
|
ORDER BY
|
|
(width IN (16,32,48,64) AND height IN (16,32,48,64)) DESC,
|
|
width DESC
|
|
LIMIT 1
|
|
);
|
|
```
|
|
|
|
**Stats emitted:** Hosts with icons, hosts without icons, icon size distribution.
|
|
|
|
### Stage 6: Bundle Generation
|
|
|
|
**Tool:** Custom Go program
|
|
|
|
**Input:** `hosts` table (joined with their best icon from S3)
|
|
|
|
**Process:**
|
|
1. Query all hosts where best_icon_id IS NOT NULL (or include no-icon hosts with a default flag)
|
|
2. Randomize the full result set (ORDER BY random() or shuffle in memory)
|
|
3. For each host:
|
|
- Download its best icon from S3 `everytab-icons`
|
|
- Decode the icon (ICO/GIF/PNG/etc.)
|
|
- For ICO files: extract the largest embedded image at a standard size <= 64x64
|
|
- Re-encode as PNG (optimized compression)
|
|
- Base64-encode the PNG bytes
|
|
4. Chunk into groups of N entries (~500-700, tuned so each JSON is ~1MB)
|
|
5. Write each chunk as `tabs/{n}.json` to S3 `everytab-site`
|
|
6. Record total bundle count
|
|
|
|
**Output:**
|
|
- `tabs/0000.json` through `tabs/{M}.json` in S3
|
|
- Total bundle count M (used in frontend build)
|
|
|
|
**Stats emitted:** Total bundles created, total hosts included, total hosts excluded (no icon), average bundle size, total S3 storage used.
|
|
|
|
### Stage 7: Frontend Build
|
|
|
|
**Tool:** Script/template that produces `index.html`
|
|
|
|
**Process:**
|
|
1. Inject `TOTAL_BUNDLES` constant into the JS (baked at build time)
|
|
2. Minify if desired
|
|
3. Upload `index.html` to S3 `everytab-site` root
|
|
|
|
### Stage 8: CloudFront Invalidation
|
|
|
|
Invalidate `/*` on the CloudFront distribution so the new site is live.
|
|
|
|
### Stage 9: Backup & Teardown
|
|
|
|
**Process:**
|
|
1. Dump RDS database to local machine (homelab) — `pg_dump` over SSH tunnel or direct
|
|
2. Sync S3 `everytab-icons` to homelab storage — `aws s3 sync`
|
|
3. Confirm backups are complete
|
|
4. Delete RDS instance
|
|
5. Delete S3 `everytab-icons` bucket
|
|
6. Terminate EC2 instance
|
|
|
|
## Frontend Architecture
|
|
|
|
### Single-File Design
|
|
|
|
One `index.html` containing inline CSS and JS. No external dependencies, no framework. Two HTTP requests per initial page load:
|
|
1. `GET /index.html` (HTML + CSS + JS, likely <50KB)
|
|
2. `GET /tabs/{random}.json` (~1MB, one bundle of ~500-700 tabs)
|
|
|
|
### Tab Rendering
|
|
|
|
- Tabs fill the viewport in rows, styled to mimic Firefox browser tabs (v1)
|
|
- Each row has a slight horizontal marquee animation (CSS) at varying speeds
|
|
- Tab density adapts to viewport width (responsive)
|
|
- Each tab shows: favicon (or blank for no-icon) + truncated title
|
|
|
|
### Interaction
|
|
|
|
- **Click tab (iframe_ok=true):** Opens an iframe overlay showing the actual site
|
|
- **Click tab (iframe_ok=false):** Opens site in a new tab (with external link indicator)
|
|
- **Close:** X button or click-away dismisses the iframe/overlay
|
|
- **Scroll down:** Triggers fetch of additional random bundles (infinite scroll)
|
|
|
|
### Randomization
|
|
|
|
- Seed: current UTC date (so everyone on the same day sees the same "shuffle", but it changes daily)
|
|
- Generate random bundle index in range [0, TOTAL_BUNDLES)
|
|
- Track fetched bundle IDs in a Set to avoid duplicates on scroll
|
|
|
|
### No-Icon Hosts
|
|
|
|
Hosts without a favicon are included in bundles with `"icon": null`. Frontend renders these Firefox-style: just the title text with no icon. This matches Firefox's behavior for tabs without favicons.
|
|
|
|
## Cost Estimate
|
|
|
|
### Scanning Phase (One-Time per Crawl)
|
|
|
|
| Item | Estimate |
|
|
|------|----------|
|
|
| EC2 c5.xlarge (~24-48hrs) | $8-16 |
|
|
| RDS db.t3.medium (~48hrs) | $3-5 |
|
|
| S3 icons storage (temporary, ~500GB) | $12 (prorated to days) |
|
|
| S3 GET requests (30M WARC reads) | $12 |
|
|
| Data transfer (icon downloads, ~500GB inbound) | $0 (inbound is free) |
|
|
| **Total** | **~$35-45** |
|
|
|
|
### Hosting Phase (Monthly Steady-State)
|
|
|
|
| Item | Estimate |
|
|
|------|----------|
|
|
| S3 storage (~60GB bundles) | $1.40 |
|
|
| CloudFront (free tier: 1TB/month, 10M requests) | $0* |
|
|
| S3 requests (via CloudFront origin pulls, cached) | ~$1-5 |
|
|
| **Total** | **~$3-10/month** |
|
|
|
|
*CloudFront free tier covers moderate traffic. Costs increase if the site goes viral, but that's a good problem to have.
|
|
|
|
## Scaling Strategy
|
|
|
|
### Development (100K domains)
|
|
- Single EC2 instance
|
|
- All stages complete in minutes-to-hours
|
|
- Good for validating the full pipeline end-to-end
|
|
|
|
### Full Scan (30M domains)
|
|
- Single EC2 instance, high concurrency
|
|
- CC-Index query: <1hr
|
|
- WARC parsing: 2-6hrs (limited by S3 request rate)
|
|
- Icon download: 12-48hrs (limited by network + remote server response times)
|
|
- Bundle generation: 1-2hrs
|
|
|
|
### Fleet Scaling (if needed)
|
|
- Spin up N identical EC2 instances running the icon downloader
|
|
- All share the same RDS instance
|
|
- Work claiming via Postgres atomic UPDATEs (no coordinator needed)
|
|
- Linear scaling: 4 instances = ~4x throughput
|
|
|
|
## Key Design Decisions
|
|
|
|
1. **Static-only hosting** — No servers running for the live site. Entire frontend is pre-built.
|
|
2. **Inline icons in bundles** — No per-icon requests. One bundle fetch gives you ~600 tabs to render.
|
|
3. **Unbound as system resolver** — Transparent to application code. Go HTTP client works normally; DNS just happens to resolve locally.
|
|
4. **Content-addressed icon storage** — S3 key is the content hash. Natural dedup at storage layer during scanning (but icons are duplicated across bundles for simplicity).
|
|
5. **Resumable pipeline** — Each stage uses database state (parsed, scan_state) to track progress. Crash and restart without re-doing completed work.
|
|
6. **PNG as universal icon format** — All icons converted to PNG for bundles regardless of source format. Smallest file size for small raster images, universally supported in browsers via data URIs.
|
|
7. **Date-seeded randomization** — Everyone visiting on the same day sees the same tab arrangement, creating a shared experience. Changes daily for freshness.
|