# EveryTab Architecture
## System Overview
EveryTab is a static website that displays a page full of browser tabs representing every website on the internet. The system has two phases:
1. **Scanning Phase** — A data pipeline that extracts website metadata from Common Crawl, downloads favicons, and processes them into servable bundles.
2. **Hosting Phase** — A static site served via S3 + CloudFront that renders tabs using pre-built JSON bundles.
The scanning phase runs monthly (triggered by new Common Crawl releases), produces a static site, and then its infrastructure is torn down after backing up data to the homelab. The hosting phase runs indefinitely at minimal cost.
## Workflow Diagram
```mermaid
flowchart TD
subgraph EC2["Scanning Phase (EC2 instance)"]
A["Stage 1: Query CC-Index via DuckDB"]
B["Stage 2: Parse WARCs - Go"]
C["Stage 3: Download Icons - Go"]
D["Stage 4: Select Best Icons"]
E["Stage 5: Generate Bundles - Go"]
F["Stage 6: Build Frontend"]
UB["Unbound - Local recursive resolver"]
A --> B --> C --> D --> E --> F
UB -.-> C
end
subgraph ExtData["External Data"]
CC["Common Crawl S3 - Parquet Index + WARCs"]
end
subgraph AWS["AWS Services"]
RDS[("RDS Postgres - hosts + icons tables")]
S3I["S3: everytab-icons - Raw downloaded favicons"]
S3S["S3: everytab-site - tabs/*.json + index.html"]
CF["CloudFront CDN"]
end
subgraph Post["Post-Scan"]
BAK["Backup to Homelab - RDS dump + icons sync"]
TEAR["Teardown - Delete RDS, icons bucket, EC2"]
end
CC --> A
CC --> B
A --> RDS
B --> RDS
B --> S3I
C --> S3I
C --> RDS
D --> RDS
E --> S3S
F --> S3S
S3S --> CF
F --> BAK
BAK --> TEAR
```
**Key point:** DuckDB, Go programs, and Unbound all run on the same EC2 instance. The pipeline is sequential — one stage completes before the next begins.
## AWS Infrastructure
All resources in **us-east-1**.
| Resource | Purpose | Lifecycle |
|----------|---------|-----------|
| EC2 (c5.xlarge) | Run all pipeline stages | Scanning only |
| RDS Postgres (db.t3.medium) | Store hosts/icons metadata | Scanning only (backup to homelab, then delete) |
| S3 `everytab-icons` | Raw downloaded favicons | Scanning only (backup to homelab, then delete) |
| S3 `everytab-site` | Static site: index.html, site.js, tabs/*.json | Permanent |
| CloudFront | CDN for static site (Brotli compression enabled) | Permanent |
| Unbound (on EC2) | Local recursive DNS resolver | Scanning only (runs on EC2) |
### Why Two S3 Buckets
- `everytab-site` is configured as a CloudFront origin with public read access (via OAC). The entire bucket IS the website.
- `everytab-icons` is completely private — only the EC2 instance reads/writes to it. No public access configuration needed.
- Backup is clean: `aws s3 sync s3://everytab-icons/ /homelab/path/` grabs the whole bucket.
- Deletion is clean: `aws s3 rb s3://everytab-icons --force` — zero risk of nuking the live site.
- One bucket with prefix-based policies works but is fiddlier (CloudFront must serve `tabs/` and `index.html` but NOT `icons/`). Two buckets eliminates that surface area for misconfiguration.
### Steady-State (Hosting Only)
- S3 `everytab-site` — index.html + site.js + ~50K JSON bundles
- CloudFront distribution — Brotli-compressed delivery, caching
## Data Model
### `hosts` table
| Column | Type | Description |
|--------|------|-------------|
| id | SERIAL PRIMARY KEY | Internal ID |
| hostname | TEXT NOT NULL UNIQUE | e.g., `example.com` |
| protocol | TEXT NOT NULL | `https` or `http` (prefer https) |
| crawl_id | TEXT NOT NULL | CC crawl identifier (e.g., `CC-MAIN-2026-05`) |
| warc_filename | TEXT NOT NULL | Path to WARC file in CC's S3 |
| warc_record_offset | BIGINT NOT NULL | Byte offset into WARC file |
| warc_record_length | INT NOT NULL | Length of WARC record |
| html_title | TEXT | Extracted from `
` tag |
| iframe_allowed | BOOLEAN | True if site allows framing |
| best_icon_s3_key | TEXT | S3 key of the chosen icon (denormalized for fast bundle gen) |
| parsed | BOOLEAN DEFAULT FALSE | Whether WARC has been parsed |
### `icons` table
| Column | Type | Description |
|--------|------|-------------|
| id | SERIAL PRIMARY KEY | Internal ID |
| host_id | INT REFERENCES hosts(id) | FK to parent host |
| url | TEXT NOT NULL | Full URL to the icon |
| source | TEXT NOT NULL | `favicon_ico` or `link_rel` |
| rel_type | TEXT | MIME type from HTML attribute (if specified) |
| rel_sizes | TEXT | Sizes attribute from HTML (if specified) |
| content_type | TEXT | Actual MIME type after download |
| width | INT | Best usable pixel width (for ICO: largest standard size ≤64; for SVG: NULL) |
| height | INT | Best usable pixel height (for ICO: largest standard size ≤64; for SVG: NULL) |
| file_size | INT | Size in bytes |
| s3_key | TEXT | Key in everytab-icons bucket (SHA-256 of content) |
| scan_state | TEXT DEFAULT 'unscanned' | `unscanned`, `in_progress`, `completed`, `failed` |
| error | TEXT | Error message if failed |
**Indexes:**
- `CREATE INDEX idx_icons_unscanned ON icons(id) WHERE scan_state = 'unscanned'` — partial index for work claiming. Only indexes unscanned rows; shrinks as work completes. Minimal write overhead since index only updates on transition OUT of 'unscanned'.
- `idx_icons_host_id` on (host_id) — for best-icon selection query
**S3 Key Strategy:** SHA-256 hash of the downloaded icon content. This gives free dedup at the storage layer — if two sites serve the exact same favicon bytes, we store it once. The hash is computed client-side (by the Go downloader) and used as the key. Before uploading, check if the key exists; if so, skip the upload but still record the s3_key in the icons table.
### Bundle JSON format (`tabs/{n}.json`)
```json
{
"entries": [
{
"host": "example.com",
"title": "Example Domain",
"icon": "iVBORw0KGgo...",
"icon_w": 32,
"icon_h": 32,
"iframe_ok": true
},
{
"host": "no-favicon-site.org",
"title": "A Site Without Favicon",
"icon": "",
"iframe_ok": false
}
]
}
```
Icons are stored inline as base64-encoded PNG. Hosts without favicons are included (with `"icon": ""`) as long as they have a title. CloudFront serves bundles with Brotli compression, which significantly reduces transfer size of base64 data.
Bundle size is parameterized (`ENTRIES_PER_BUNDLE`). Target: enough entries to fill a viewport plus scroll buffer. Initial estimate ~100-150 entries (~150-300KB uncompressed, smaller after Brotli). Will be tuned empirically once the frontend is built and we can measure how many tabs fill a screen.
## Pipeline Stages
The pipeline is a series of manually-run scripts executed in order on the single EC2 instance. Each stage is idempotent and resumable.
### Stage 1: CC-Index Query
**Tool:** DuckDB with httpfs extension (query CC parquet directly from S3; if >1hr, fall back to downloading parquet locally first)
**Input:** Common Crawl columnar index (parquet files on `s3://commoncrawl/cc-index/...`)
**Query logic:**
```sql
WHERE url_path = '/'
AND content_mime_type = 'text/html'
AND fetch_status = 200
AND url_query IS NULL
AND url_protocol IN ('http', 'https')
AND url_port IS NULL
```
**Deduplication:** Per hostname, prefer `https` over `http`. Result is one row per unique hostname.
**Output:** Populates `hosts` table in RDS (~30M rows for a full crawl).
**Cost:** $0 — Common Crawl is part of the AWS Open Data Registry. S3 GET requests and data transfer within us-east-1 are free.
**Stats emitted:** Total domains found, https vs http breakdown, duplicates removed.
### Stage 2: WARC Parsing
**Tool:** Custom Go program, highly concurrent
**Input:** `hosts` table rows where `parsed = FALSE`
**Process:**
1. Read batches of unparsed rows (cursor-based pagination by ID)
2. For each row, make a byte-range GET request to Common Crawl's S3:
- `Range: bytes={offset}-{offset+length-1}`
- Target: `https://data.commoncrawl.org/{warc_filename}`
3. Parse the WARC record to extract the HTTP response
4. From HTTP response headers: check for `X-Frame-Options` and `Content-Security-Policy` frame-ancestors
5. Parse HTML defensively (lenient parser, handle malformed HTML):
- Extract `` tag content
- Extract ALL `` / `` entries with their href, type, and sizes attributes
6. Insert a `/favicon.ico` entry into `icons` for every host (protocol://hostname/favicon.ico)
7. Insert all discovered `link rel="icon"` entries into `icons` (any format: ICO, PNG, GIF, SVG, WebP, JPEG)
8. Update `hosts` row: html_title, iframe_allowed, parsed = TRUE
**Concurrency:** High — thousands of goroutines with a semaphore/pool. CC's S3 handles massive throughput.
**Error handling:** Malformed HTML → still extract what we can (partial title, partial icons). WARC fetch failure → log and skip (mark parsed = TRUE with NULL title to avoid retry loops). All errors logged with hostname for investigation.
**Icon URL handling:** Relative URLs resolved against `{protocol}://{hostname}/`. Absolute URLs kept as-is. Data URIs ignored.
**No scan_state needed:** CC's S3 is highly reliable. The `parsed` boolean is sufficient. If the process crashes mid-batch, re-run picks up where it left off (unparsed rows).
**Cost:** $0 (same Open Data program).
**Stats emitted:** Rows processed, titles extracted, icons found (by source: favicon_ico vs link_rel), icon format distribution, iframe restrictions found, parse failures, rows with no title.
### Stage 3: Icon Download
**Tool:** Custom Go program, highly concurrent
**Prerequisite:** Unbound running as system resolver on the EC2 instance.
**Input:** `icons` table rows where `scan_state = 'unscanned'` and icon is worth downloading:
- All `favicon_ico` entries (always attempt)
- `link_rel` entries with no declared size (unknown, could be useful)
- `link_rel` entries with declared size ≤64x64
- Skip `link_rel` entries with declared size >64x64 (192x192, 180x180, 152x152, etc. — apple-touch-icon bloat we won't use at tab scale)
**Process:**
1. Claim batch (randomized to spread load across hosts):
```sql
UPDATE icons SET scan_state = 'in_progress'
WHERE id IN (
SELECT id FROM icons
WHERE scan_state = 'unscanned'
ORDER BY md5(id::text) -- deterministic shuffle: spreads hosts apart
LIMIT N
FOR UPDATE SKIP LOCKED
) RETURNING *;
```
This ensures requests to the same domain aren't back-to-back. With 30M+ icons from different hosts, a random batch of 1000 almost never contains two icons from the same server.
2. For each icon URL:
- Make HTTP(S) GET request (standard Go HTTP client — DNS transparently goes through Unbound)
- Enforce timeouts: 5s connect, 10s total
- Enforce max download size: 512KB (generous for icons, but prevents abuse)
- On success:
- Validate magic bytes (is this actually an image?)
- Decode to get dimensions:
- PNG/GIF/WebP/JPEG/BMP: read image headers for width/height
- ICO: parse ICO header, find largest embedded size ≤64x64 at a standard dimension (16/32/48/64), store THAT size in width/height
- SVG: store width=NULL, height=NULL (vector, no pixel size)
- Compute SHA-256 of content
- Upload to S3 `everytab-icons/{sha256}` (skip if key already exists — dedup)
- Update icons row: s3_key, content_type (from actual data, not HTTP header), width, height, file_size, scan_state = 'completed'
- On failure: scan_state = 'failed', error = reason
**Concurrency:** Goroutine pool with configurable size (start 1000, tune based on system resources). Semaphore pattern for backpressure. Monitor memory usage.
**Fast failure strategy:**
- DNS failure → fail immediately (Unbound will cache NXDOMAIN)
- Connection refused → fail immediately
- Timeout → fail after deadline (no retry)
- Too large → abort read at 512KB boundary
- Not an image → fail (record content-type in error)
**Permissive on format:** Download everything — ICO, PNG, GIF, SVG, WebP, JPEG, BMP, whatever the server returns. Store the raw bytes in S3. Format filtering and conversion happens later in bundle generation.
**Scaling to fleet (if needed):**
- Multiple EC2 instances run the same binary
- Each claims work via Postgres row-level locking (`FOR UPDATE SKIP LOCKED`)
- No coordinator needed — linear scaling with instance count
**Stats emitted:** Icons attempted, completed, failed (breakdown by error type: DNS, timeout, connection refused, HTTP 4xx, HTTP 5xx, invalid image, too large), icons/sec rate, bytes downloaded, unique S3 keys (dedup hits).
### Stage 4: Best Icon Selection
**Tool:** SQL script
**Process:** For each host, select the best icon from its completed downloads:
```sql
UPDATE hosts h SET best_icon_s3_key = (
SELECT i.s3_key FROM icons i
WHERE i.host_id = h.id
AND i.scan_state = 'completed'
ORDER BY
-- Prefer standard square sizes
CASE
WHEN i.width = i.height AND i.width IN (64, 48, 32, 16) THEN 0
WHEN i.width = i.height AND i.width <= 64 THEN 1
WHEN i.width <= 64 AND i.height <= 64 THEN 2
ELSE 3
END,
-- Among valid options, prefer larger
i.width DESC,
-- Prefer PNG/GIF/ICO over SVG/WebP for simpler processing
CASE
WHEN i.content_type IN ('image/png', 'image/gif', 'image/x-icon', 'image/vnd.microsoft.icon') THEN 0
WHEN i.content_type IN ('image/webp') THEN 1
WHEN i.content_type IN ('image/svg+xml') THEN 2
ELSE 3
END,
-- Smaller file size as tiebreaker
i.file_size ASC
LIMIT 1
);
```
**Note on SVG/WebP:** These are downloaded and stored during scanning but are lower priority for bundle selection. Rasterizing SVG to PNG adds complexity; WebP re-encoding to PNG may increase size. If a host ONLY has SVG/WebP icons, we still use them (convert in bundle generation). But if PNG/GIF/ICO alternatives exist, prefer those.
**Stats emitted:** Hosts with icons selected, hosts without any icon, icon size distribution, format distribution of selected icons.
### Stage 5: Bundle Generation
**Tool:** Custom Go program (multi-threaded for image processing)
**Input:** All hosts where `html_title IS NOT NULL` (include hosts without icons)
**Process:**
1. Query all qualifying hosts from RDS (with their best_icon_s3_key)
2. Randomize the full result set
3. For each host with an icon (best_icon_s3_key IS NOT NULL):
- Download from S3 `everytab-icons/{s3_key}`
- Decode the image based on format:
- ICO: parse container, extract the image at the size recorded in width/height (the largest standard size ≤64x64). ICO can embed BMP or PNG internally — decode whichever is present.
- PNG: decode directly
- GIF/WebP/BMP/JPEG: decode to raster
- SVG: rasterize to 32x32 (use a Go SVG rasterizer library)
- Re-encode as optimized PNG at original dimensions (never upscale — a 16x16 stays 16x16)
- Base64-encode the PNG bytes
4. For hosts without icons: set icon to empty string
5. Chunk into groups of `ENTRIES_PER_BUNDLE` entries (parameterized, initially ~100-150, tuned to viewport fill)
6. Serialize each chunk as JSON, write to S3 `everytab-site/tabs/{n}.json`
7. Record total bundle count
**Output:**
- `tabs/0.json` through `tabs/{M}.json` in S3 `everytab-site`
- Total bundle count M
- `stats.json` in S3 `everytab-site` (pipeline statistics)
**Stats emitted:** Total bundles created, total hosts included (with icon / without icon), average bundle size (bytes), total S3 storage used, icon conversion failures.
### Stage 6: Frontend Build
**Tool:** Simple script or template engine
**Process:**
1. Inject `const TOTAL_BUNDLES = {M};` into the JS
2. Write `index.html` and `site.js` to S3 `everytab-site`
3. Invalidate CloudFront distribution (`/*`)
### Stage 7: Backup & Teardown
**Process (manual, with confirmation at each step):**
1. Dump RDS database: `pg_dump` → transfer to homelab
2. Sync icons: `aws s3 sync s3://everytab-icons/ homelab:/path/to/backup/icons/`
3. **Verify backups:** confirm pg_dump restores cleanly on homelab, spot-check icon files
4. Delete RDS instance (skip final snapshot — homelab backup is the source of truth, snapshots cost $0.095/GB-month)
5. Delete S3 `everytab-icons` bucket
6. Terminate EC2 instance
## DNS Architecture
**Unbound** runs on the EC2 instance as the system DNS resolver.
**Configuration:**
- Recursive resolver mode (no forwarding to any upstream — resolves from root servers)
- Listening on 127.0.0.1:53
- Set as system resolver in `/etc/resolv.conf`
- Aggressive caching enabled
- High min-TTL (3600s) — maximizes cache hits for TLD/popular nameservers
- High cache size (allocate 1-2GB RAM to Unbound)
- Prefetch enabled (refresh popular entries before expiry)
**Why recursive instead of forwarding:** Forwarding to Google/Cloudflare would get us rate-limited at 30M+ lookups. Recursive resolution distributes load across thousands of authoritative nameservers. With caching, the actual external query volume is much lower than 30M (most domains share TLD nameservers, many share CDN nameservers).
**Transparent to Go:** The Go HTTP client uses the OS resolver, which uses Unbound. No custom transport, no SNI issues, no pre-resolved IPs needed. Standard HTTPS connections with normal hostname verification.
## Frontend Architecture
### File Structure
- `index.html` — minimal HTML shell, inline CSS
- `site.js` — tab rendering logic, bundle fetching, interaction (separate file for cleanliness, cached after first load)
### Requests Per Visit
1. `GET /index.html` — HTML + CSS (<10KB)
2. `GET /site.js` — JavaScript (cached indefinitely via content hash in filename or cache headers)
3. `GET /tabs/{random}.json` — first bundle (~150-300KB, Brotli-compressed to ~100-200KB)
Subsequent scrolls: one additional `/tabs/{n}.json` per "page" of tabs.
### Tab Rendering
- Rows of tabs fill the viewport, styled to mimic Firefox browser tabs (v1)
- Each row has a subtle horizontal marquee animation (CSS `@keyframes` / `animation`) at slightly varying speeds
- Tab density adapts to viewport width (responsive)
- Each tab shows: favicon (rendered via ``) + truncated title
- No-icon tabs: just title text, no icon (Firefox behavior)
- Enough tabs rendered to fill viewport + buffer below fold (so user can scroll immediately without waiting for next fetch)
### Interaction
- **Click tab (iframe_ok=true):** Opens an iframe overlay showing the actual site
- **Click tab (iframe_ok=false):** Opens site in a new tab (with subtle external-link indicator on the tab)
- **Close overlay:** X button or click outside dismisses iframe
- **Scroll down:** When approaching the bottom, fetch next random bundle and render more rows
### Randomization
- Seed: `Date.now()` (milliseconds UTC) — every visitor at a different moment sees different tabs
- PRNG: seeded random number generator (e.g., mulberry32 or xoshiro) for deterministic sequence from seed
- Generate random bundle indices in range `[0, TOTAL_BUNDLES)`
- Track fetched bundle IDs in a `Set` to avoid loading duplicates on continued scroll
### Future Enhancements (v2+)
- Browser-specific tab styles (Chrome tabs for Chrome users, Safari for Safari, etc.)
- Mobile-optimized layout
- "Search for a site" feature
- Stats page (how many sites, coverage, etc.)
## Statistics & Metadata
Each pipeline stage emits a JSON stats file:
```
stats/
01_cc_index.json
02_warc_parse.json
03_icon_download.json
04_best_icon.json
05_bundle_gen.json
```
After bundle generation, these are merged into a single `stats.json` uploaded to `everytab-site`:
```json
{
"crawl_id": "CC-MAIN-2026-05",
"generated_at": "2026-05-17T12:00:00Z",
"pipeline": {
"cc_index": {
"started_at": "2026-05-17T08:00:00Z",
"finished_at": "2026-05-17T08:42:00Z",
"duration_seconds": 2520,
"total_domains": 31245678,
"https": 28901234,
"http_only": 2344444,
"duplicates_removed": 1456789
},
"warc_parse": {
"started_at": "2026-05-17T08:45:00Z",
"finished_at": "2026-05-17T12:15:00Z",
"duration_seconds": 12600,
"processed": 31245678,
"titles_extracted": 29876543,
"icons_found": 45678901,
"iframe_restricted": 12345678,
"parse_failures": 234567
},
"icon_download": {
"started_at": "2026-05-17T12:20:00Z",
"finished_at": "2026-05-18T18:30:00Z",
"duration_seconds": 108600,
"attempted": 45678901,
"completed": 38901234,
"failed_dns": 2345678,
"failed_timeout": 1234567,
"failed_http_error": 1567890,
"failed_invalid_image": 890123,
"failed_too_large": 12345,
"unique_icons_stored": 34567890,
"dedup_hits": 4333344
},
"best_icon": {
"started_at": "2026-05-18T18:35:00Z",
"finished_at": "2026-05-18T18:40:00Z",
"duration_seconds": 300,
"hosts_with_icon": 27654321,
"hosts_without_icon": 3591357
},
"bundles": {
"started_at": "2026-05-18T18:45:00Z",
"finished_at": "2026-05-18T20:10:00Z",
"duration_seconds": 5100,
"total_bundles": 52341,
"total_hosts_included": 29876543,
"hosts_with_icon": 27654321,
"hosts_without_icon": 2222222,
"excluded_no_title": 1369135,
"avg_bundle_size_bytes": 245000
}
}
}
```
This is served publicly at `/stats.json` on the live site — interesting metadata for visitors and useful for monitoring pipeline health across crawls.
## Cost Estimate
### Scanning Phase (One-Time per Crawl)
| Item | Estimate |
|------|----------|
| EC2 c5.xlarge (~24-48hrs) | $8-16 |
| RDS db.t3.medium (~48-72hrs including dev time) | $3-7 |
| S3 everytab-icons storage (~500GB, prorated to days) | $1-3 |
| S3 PUT requests (icon uploads, ~30M) | $15 |
| Common Crawl S3 reads (CC-Index + WARCs) | $0 (Open Data) |
| Data transfer (icon downloads from internet, inbound) | $0 (inbound free) |
| Data transfer (backup to homelab, outbound) | $5-10 |
| **Total** | **~$32-51** |
### Hosting Phase (Monthly Steady-State)
| Item | Estimate |
|------|----------|
| S3 everytab-site storage (~10-15GB of bundles) | $0.35 |
| CloudFront (free tier: 1TB/month transfer, 10M requests/month) | $0 |
| S3 origin requests via CloudFront (heavily cached) | $1-3 |
| **Total** | **~$2-4/month** |
Note: Bundle storage estimate revised down. With ~50K bundles at ~250KB each = ~12.5GB, well under previous estimate since we're targeting viewport-fill (100-150 tabs) not 1MB bundles.
If the site gets significant traffic beyond CloudFront free tier, costs scale with usage — but that's a success problem.
## Scaling Strategy
### Development Phase (100K domains)
- Cap CC-Index query to 100K rows
- Full pipeline runs in minutes
- Validates end-to-end correctness
- Frontend development and tab-density tuning
### Full Scan (30M domains)
- Single EC2 instance, high concurrency
- CC-Index query: <1hr (httpfs) or ~2hrs (download + local query)
- WARC parsing: 2-6hrs
- Icon download: 12-48hrs (the long pole)
- Bundle generation: 1-2hrs
- Total: ~1-2 days
### Fleet Scaling (if single instance is too slow)
- Spin up N identical EC2 instances running the icon downloader
- All connect to the same RDS instance
- Work claiming via `FOR UPDATE SKIP LOCKED` — no double work, no coordinator
- Linear throughput scaling: 4 instances ≈ 4x download speed
- Only the icon download stage benefits from fleet (other stages are fast enough solo)
## Key Design Decisions
1. **Static-only hosting** — No servers for the live site. Everything pre-built. Minimal attack surface, minimal cost.
2. **Inline icons in bundles** — One fetch gives you 100+ tabs to render. No per-icon requests.
3. **Base64 + Brotli** — Base64 for browser-native decoding (`atob()`). Brotli compression at the CDN layer reduces transfer size by ~25-30% for free.
4. **Unbound as system resolver** — Transparent to application code. Standard Go HTTP. No custom networking.
5. **SHA-256 content-addressed icon storage** — Natural dedup at S3 layer. Same favicon stored once even if referenced by multiple hosts.
6. **Permissive download, selective bundling** — Download ALL favicon formats during scanning. Convert to optimized PNG only during bundle generation. Decouples "capture as much as possible" from "serve the best version."
7. **Partial index for work claiming** — Indexes only unscanned rows. Shrinks as work progresses. Minimal write amplification.
8. **Two S3 buckets** — Clean separation of concerns. Private working storage vs public site. Safe deletion of temporary data.
9. **Per-millisecond random seed** — Every visitor sees a unique arrangement. No shared state, no server needed for randomization.
10. **Viewport-sized bundles** — ~100-150 tabs per bundle, tuned to fill a screen. Faster loads, smaller memory footprint than 1MB bundles.
11. **Include no-icon hosts** — Any host with a title is included. Firefox-style rendering (title only) for hosts without favicons.
12. **Denormalized best_icon_s3_key in hosts** — Avoids joins during bundle generation. Written once during icon selection, read once during bundling.