30 KiB
EveryTab Architecture
System Overview
EveryTab is a static website that displays a page full of browser tabs representing every website on the internet. The system has two phases:
- Scanning Phase — A data pipeline that extracts website metadata from Common Crawl, downloads favicons, and processes them into servable bundles.
- Hosting Phase — A static site served via S3 + CloudFront that renders tabs using pre-built JSON bundles.
The scanning phase runs monthly (triggered by new Common Crawl releases), produces a static site, and then its infrastructure is torn down after backing up data to the homelab. The hosting phase runs indefinitely at minimal cost.
Workflow Diagram
flowchart TD
subgraph EC2["Scanning Phase (EC2 instance)"]
A["Stage 1: Query CC-Index via DuckDB"]
B["Stage 2: Parse WARCs - Go"]
C["Stage 3: Download Icons - Go"]
D["Stage 4: Select Best Icons"]
E["Stage 5: Generate Bundles - Go"]
F["Stage 6: Deploy Frontend"]
UB["Unbound - Local recursive resolver"]
DISK["Local disk - Sharded icon archive"]
A --> B --> C --> D --> E --> F
UB -.-> C
C --> DISK
DISK --> E
end
subgraph ExtData["External Data"]
CC["Common Crawl S3 - Parquet Index + WARCs"]
end
subgraph AWS["AWS Services"]
RDS[("RDS Postgres - hosts + icons tables")]
S3S["S3: everytab-site - tabs/*.json + index.html"]
CF["CloudFront CDN"]
end
subgraph Post["Post-Scan"]
BAK["Backup to Homelab - RDS dump + icons rsync"]
TEAR["Teardown - Delete RDS, EC2"]
end
CC --> A
CC --> B
A --> RDS
B --> RDS
C --> RDS
D --> RDS
E --> S3S
F --> S3S
S3S --> CF
F --> BAK
BAK --> TEAR
Key point: DuckDB, Go programs, and Unbound all run on the same EC2 instance. The pipeline is sequential — one stage completes before the next begins.
AWS Infrastructure
All resources in us-east-1.
| Resource | Purpose | Lifecycle |
|---|---|---|
| EC2 (c5.2xlarge) + 1TB EBS | Run all pipeline stages, store icon archive | Scanning only |
| RDS Postgres (db.m5.large) | Store hosts/icons metadata | Scanning only (backup to homelab, then delete) |
S3 everytab-site |
Static site: index.html, site.js, tabs/*.json | Permanent |
| CloudFront | CDN for static site (Brotli compression enabled) | Permanent |
S3 everytab-logs |
CloudFront access logs | Permanent |
| Unbound (on EC2) | Local recursive DNS resolver | Scanning only (runs on EC2) |
Icon Storage
Icons are stored on local disk during scanning, not S3. The EBS volume holds the full icon archive in a sharded directory structure (ab/cd/ef/{sha256}). This avoids ~$175 in S3 PUT costs at 30M scale. After scanning completes, icons are backed up to the homelab via rsync.
Steady-State (Hosting Only)
- S3
everytab-site— index.html + site.js + ~250K JSON bundles - CloudFront distribution — Brotli-compressed delivery, caching
Data Model
hosts table
| Column | Type | Description |
|---|---|---|
| id | SERIAL PRIMARY KEY | Internal ID |
| hostname | TEXT NOT NULL UNIQUE | e.g., example.com |
| protocol | TEXT NOT NULL | https or http (prefer https) |
| crawl_id | TEXT NOT NULL | CC crawl identifier (e.g., CC-MAIN-2026-05) |
| warc_filename | TEXT NOT NULL | Path to WARC file in CC's S3 |
| warc_record_offset | BIGINT NOT NULL | Byte offset into WARC file |
| warc_record_length | INT NOT NULL | Length of WARC record |
| html_title | TEXT | Extracted from <title> tag |
| iframe_allowed | BOOLEAN | True if site allows framing |
| best_icon_hash | TEXT | SHA-256 hash of the chosen icon file (denormalized for fast bundle gen) |
| parsed | BOOLEAN DEFAULT FALSE | Whether WARC has been parsed |
| random_order | DOUBLE PRECISION DEFAULT random() | Random value for shuffled bundle generation pagination |
icons table
| Column | Type | Description |
|---|---|---|
| id | SERIAL PRIMARY KEY | Internal ID |
| host_id | INT REFERENCES hosts(id) | FK to parent host |
| url | TEXT NOT NULL | Full URL to the icon |
| source | TEXT NOT NULL | favicon_ico or link_rel |
| rel_type | TEXT | MIME type from HTML attribute (if specified) |
| rel_sizes | TEXT | Sizes attribute from HTML (if specified) |
| content_type | TEXT | Actual MIME type after download |
| width | INT | Best usable pixel width (for ICO: largest standard size ≤64; for SVG: NULL) |
| height | INT | Best usable pixel height (for ICO: largest standard size ≤64; for SVG: NULL) |
| file_size | INT | Size in bytes |
| icon_hash | TEXT | SHA-256 hash of content (used as local file path: ab/cd/ef/{hash}) |
| scan_state | TEXT DEFAULT 'unscanned' | unscanned, in_progress, completed, failed |
| error | TEXT | Error message if failed |
| downloaded_at | TIMESTAMPTZ | When the icon was fetched (NULL if not yet downloaded) |
Indexes:
CREATE INDEX idx_icons_unscanned ON icons(id) WHERE scan_state = 'unscanned'— partial index for work claiming. Only indexes unscanned rows; shrinks as work completes. Minimal write overhead since index only updates on transition OUT of 'unscanned'.idx_icons_host_idon (host_id) — for best-icon selection query
Content-Addressed Storage: SHA-256 hash of the downloaded icon content, used as the local file path (ab/cd/ef/{full_hash}). This gives free dedup — if two sites serve the exact same favicon bytes, we store it once. Before writing, check if the file exists; if so, skip the write but still record the hash in the icons table.
Bundle JSON format (tabs/{n}.json)
{
"entries": [
{
"url": "https://example.com",
"title": "Example Domain",
"icon": "iVBORw0KGgo...",
"icon_w": 32,
"icon_h": 32,
"iframe_ok": true
},
{
"url": "http://no-favicon-site.org",
"title": "A Site Without Favicon",
"icon": "",
"iframe_ok": false
}
]
}
Icons are stored inline as base64-encoded PNG. Hosts without favicons are included (with "icon": "") as long as they have a title. CloudFront serves bundles with Brotli compression, which significantly reduces transfer size of base64 data.
Bundle size is parameterized (ENTRIES_PER_BUNDLE, default 120). Tuned to fill a viewport plus scroll buffer. Average bundle size ~215KB uncompressed, significantly smaller after Brotli.
Pipeline Stages
The pipeline is a series of manually-run scripts executed in order on the single EC2 instance. Each stage is idempotent and resumable.
Stage 1: CC-Index Query
Tool: DuckDB with aws extension (credential chain) to read parquet directly from S3
Input: Common Crawl columnar index (parquet files on s3://commoncrawl/cc-index/...)
Query logic:
WHERE url_path = '/'
AND content_mime_type = 'text/html'
AND fetch_status = 200
AND url_query IS NULL
AND url_protocol IN ('http', 'https')
AND url_port IS NULL
Deduplication: Per hostname, prefer https over http. Result is one row per unique hostname.
Output: Populates hosts table in RDS (~30M rows for a full crawl).
Cost: $0 — Common Crawl is part of the AWS Open Data Registry. S3 GET requests and data transfer within us-east-1 are free.
Stats emitted: Total domains found, https vs http breakdown, duplicates removed.
Stage 2: WARC Parsing
Tool: Custom Go program, highly concurrent
Input: hosts table rows where parsed = FALSE
Process:
- Read batches of unparsed rows (cursor-based pagination by ID)
- For each row, make a byte-range S3 GetObject request to the
commoncrawlbucket:Range: bytes={offset}-{offset+length-1}- Uses AWS SDK (not
data.commoncrawl.orgHTTPS endpoint, which rate-limits at ~100 concurrent connections)
- Parse the WARC record to extract the HTTP response
- From HTTP response headers: check for
X-Frame-OptionsandContent-Security-Policyframe-ancestors - Parse HTML defensively (lenient parser, handle malformed HTML):
- Extract
<title>tag content - Extract ALL
<link rel="icon">/<link rel="shortcut icon">entries with their href, type, and sizes attributes
- Extract
- Insert a
/favicon.icoentry intoiconsfor every host (protocol://hostname/favicon.ico) - Insert all discovered
link rel="icon"entries intoicons(any format: ICO, PNG, GIF, SVG, WebP, JPEG) - Update
hostsrow: html_title, iframe_allowed, parsed = TRUE
Architecture: Three-stage pipeline:
[DB fetcher] → hostCh → [500 workers] → resultCh → [DB writer with pgx.Batch]
- DB fetcher (1 goroutine): continuously pages through unparsed hosts (batch size 5000), feeds
hostCh. - Workers (500 goroutines, configurable): fetch WARC from S3, parse HTML, update stats, send successful results to
resultCh. I/O-bound on S3 latency. - DB writer (1 goroutine): collects results, flushes every 100 using
pgx.Batch(~400 queries per DB round-trip). S3 retry with 6 attempts and exponential backoff for transient 503s.
Error handling: Malformed HTML → still extract what we can (partial title, partial icons). WARC fetch failure → log and skip (host stays parsed = FALSE, retryable on next run). Max 50 link_rel icons per host (defensive cap against adversarial pages).
Icon URL handling: Relative URLs resolved against {protocol}://{hostname}/. Absolute URLs kept as-is. Data URIs ignored.
Cost: $0 (same Open Data program).
Stats emitted: Rows processed, titles extracted, no-title count, icons found, iframe restrictions, fetch/parse errors, DB errors, panics.
Stage 3: Icon Download
Tool: Custom Go program, highly concurrent
Prerequisite: Unbound running as system resolver on the EC2 instance.
Input: ALL icons table rows where scan_state = 'unscanned' — no size filter. Every favicon_ico and link_rel icon is downloaded regardless of declared size. The full archive is kept on disk; filtering happens later at best-icon selection and bundle generation.
Process:
- Producer goroutine claims batches via
FOR UPDATE SKIP LOCKED:
Icons are fed into a buffered channel. N worker goroutines consume from the channel, so workers never starve between batch claims.UPDATE icons SET scan_state = 'in_progress' WHERE id IN ( SELECT id FROM icons WHERE scan_state = 'unscanned' LIMIT 5000 FOR UPDATE SKIP LOCKED ) RETURNING id, url; - For each icon URL:
- Make HTTP(S) GET request (standard Go HTTP client — DNS transparently goes through Unbound)
- Shared
http.Transportfor connection pooling and TLS session reuse - Enforce timeouts: 5s connect, 10s total
- Enforce max download size: 512KB (generous for icons, but prevents abuse)
- On success:
- Validate magic bytes (is this actually an image?)
- Decode to get dimensions:
- PNG/GIF/WebP/JPEG/BMP: read image headers for width/height
- ICO: parse ICO header, find largest embedded size ≤64x64 at a standard dimension (16/32/48/64), store THAT size in width/height
- SVG: store width=NULL, height=NULL (vector, no pixel size)
- Compute SHA-256 of content
- Write to local disk at
{icons_dir}/ab/cd/ef/{sha256}(skip if file already exists — dedup) - Update icons row: icon_hash (the SHA-256 hash), content_type (from actual data, not HTTP header), width, height, file_size, scan_state = 'completed'
- On failure: scan_state = 'failed', error = reason
Concurrency: Channel-based worker pool (default 2500 workers, configurable). Producer goroutine feeds a buffered channel (buffer = batch size), shuffles each batch to avoid hitting the same host back-to-back. N workers consume from the channel.
Fast failure strategy:
- DNS failure → fail immediately (Unbound will cache NXDOMAIN)
- Connection refused → fail immediately
- Timeout → fail after deadline (no retry)
- Too large → abort read at 512KB boundary
- Not an image → fail (record content-type in error)
Permissive on format: Download everything — ICO, PNG, GIF, SVG, WebP, JPEG, BMP, whatever the server returns. Store the raw bytes on disk. Format filtering and conversion happens later in bundle generation.
Scaling to fleet (if needed):
- Multiple EC2 instances run the same binary
- Each claims work via Postgres row-level locking (
FOR UPDATE SKIP LOCKED) - No coordinator needed — linear scaling with instance count
Stats emitted: Icons attempted, completed, failed (breakdown by error type: DNS, timeout, connection refused, HTTP 4xx, HTTP 5xx, invalid image, too large), icons/sec rate, bytes downloaded, dedup hits.
Stage 4: Best Icon Selection
Tool: SQL script
Process: For each host, select the best icon from all its completed downloads.
Selection priority (decision flow):
Target: 32x32 source icon. The frontend displays favicons at 16x16 CSS pixels, which is 32x32 physical pixels on 2x Retina screens. So 32x32 is the ideal source resolution — crisp on Retina without wasting bundle space.
- Icons ≥32px (preferred): smallest first, so closest to 32 wins. A 32x32 beats a 48x48 beats a 180x180.
- Icons <32px (fallback): largest first. A 16x16 beats an 8x8.
- Unknown dimensions (NULL width/height): last resort.
Within the same size tier:
- Prefer PNG > ICO > GIF/JPEG/BMP > WebP
- Tiebreaker: smaller file size
SVGs excluded (can't rasterize without external deps). Icons ≤2x2 excluded (tracking pixels).
Does not distinguish between favicon_ico and link_rel sources — purely based on what was actually downloaded and its dimensions/format.
Uses DISTINCT ON (host_id) for efficient single-pass selection. See pipeline/04_best_icon/select.sql.
Stats emitted: Hosts with icons selected, hosts without any icon.
Stage 5: Bundle Generation
Tool: Custom Go program (multi-threaded for image processing)
Input: All hosts where html_title IS NOT NULL (include hosts without icons)
Architecture: Four-stage pipeline with all stages running concurrently:
[DB fetcher] → hostCh → [N converters] → entryCh → [bundle assembler] → uploadCh → [M uploaders]
- DB fetcher (1 goroutine): continuously fetches pages of hosts via keyset pagination on
random_order. Feeds hosts intohostCh. Never waits for downstream stages. - Converter workers (N goroutines, default 20): read hosts from
hostCh, read icon from disk, decode, re-encode as PNG, base64-encode, emitBundleEntrytoentryCh. CPU-bound — default tuned to ~5x core count on c5.xlarge (4 vCPUs).- Decode via Go's
image.Decode(handles PNG, GIF, JPEG, WebP, BMP, ICO via registered decoders) - SVGs excluded (no rasterizer) — these hosts appear without icons
- Icons >128px downscaled to 32x32 (nearest-neighbor). Icons ≤128px kept as-is.
- Decode via Go's
- Bundle assembler (1 goroutine): collects entries from
entryCh. Every 120 entries (configurable), serializes as JSON and sends touploadCh. Hosts without icons included with"icon": "". - Upload workers (M goroutines, default 10): write bundles to S3 (or local disk in dry-run mode). I/O-bound — multiple uploads in flight hides S3 PUT latency (~50-100ms each).
Bundles are written in-place (overwriting previous run). No delete-first step, so the live site always has valid data even if bundle gen crashes midway. The frontend's TOTAL_BUNDLES constant ensures only valid bundle indices are requested.
Output:
tabs/0000.jsonthroughtabs/{M}.jsonin S3everytab-site- Total bundle count M (bake into frontend via deploy script)
Stats emitted: Total bundles created, total hosts included (with icon / without icon), average bundle size (bytes), total S3 storage used, icon conversion failures.
Stage 6: Frontend Deploy
Tool: pipeline/06_frontend/deploy.sh
Process:
sedinjectsconst TOTAL_BUNDLES = {M};into a temp copy ofindex.html- Uploads
index.html,site.js,bot.html,about.htmlto S3everytab-site - Invalidates CloudFront cache for all four files (auto-detects distribution ID)
Stage 7: Backup & Teardown
Process (manual, with confirmation at each step):
- Dump RDS database:
pg_dump -Fc→ transfer to homelab via rsync - Sync icons from local disk:
rsync -avP ~/icons/ homelab:/backups/everytab/icons/ - Verify backups: confirm pg_dump restores cleanly on homelab, spot-check icon files
- Tear down scanning infra:
terraform apply -var="scanning=false"(deletes RDS, EC2, icons S3 bucket)
Performance Characteristics
Each pipeline stage has different bottlenecks. Understanding these explains the concurrency choices and why certain stages can't be sped up further on a single machine.
Stage 1: CC-Index Query
- Download phase: network-bound.
aws s3 syncof ~166GB of parquet files. Throughput limited by EC2 network bandwidth (10 Gbps on c5.2xlarge). Takes ~10-15 minutes. - Query phase: memory-bound. DuckDB loads the GROUP BY hash table into memory. At 30M output rows, the hash table approaches 16GB.
temp_directoryis set to EBS so DuckDB spills to NVMe efficiently (large sequential I/O) rather than relying on OS swap (random 4KB page faults). On c5.2xlarge (16GB RAM) with 8GB swap, the query completes without severe thrashing. - Not CPU-bound — DuckDB's columnar scan is efficient, CPU cores are underutilized during the query.
Stage 2: WARC Parsing
- CPU-bound + network I/O-bound (S3). Each WARC fetch is a byte-range S3 GetObject request (~100-200ms round-trip), but TLS handshakes + gzip decompression + HTML parsing consume significant CPU. At 500 goroutines on 4 cores, CPU was at 100%. On c5.2xlarge (8 cores), more workers can actually compute simultaneously.
- DB writes batched via
pgx.Batch— 500 results (~2000 queries) per round-trip. Non-burstable RDS (db.m5.large) provides consistent write performance. Burstable t3 instances throttle under sustained load and cause pipeline stalls via channel back-pressure. - Channel buffers sized to prevent stalls — hostCh (20K) gives the DB fetcher enough runway between queries. resultCh (1K) absorbs write latency spikes.
- S3 retry with 6 attempts and exponential backoff handles transient 503s from the
commoncrawlbucket. - Measured: 566 hosts/sec at concurrency 500 on c5.xlarge (4 cores). Expected ~1000+ hosts/sec on c5.2xlarge (8 cores).
Stage 3: Icon Download
- Network I/O-bound (internet). Downloading from millions of different web servers worldwide. Latency varies wildly (1ms to 10s). The long tail of slow/dead servers dominates — most icons download in <500ms but timeouts (10s) hold workers.
- The long pole of the pipeline — longest stage at 30M scale.
- 5000 concurrent goroutines to keep throughput high despite variable latency. Not CPU-bound (magic byte checks and SHA-256 are fast). Not DB-bound (one write per icon at ~1ms, self-smoothing due to random server latencies).
- Memory is the concurrency limit — each goroutine holds a TCP connection + TLS session + icon data buffer. At 5000 workers on c5.2xlarge (16GB), ~2-3GB for connection overhead — comfortable.
- Disk I/O is negligible — icons are small (median ~5KB), writes are sharded across directories.
- DNS is cached — Unbound's aggressive caching (1.7GB cache, 3600s min-TTL) means repeat TLD/nameserver lookups are instant. First-seen domains incur recursive resolution (~50-100ms) but this is pipelined with the HTTP request.
- Measured: 2,136 icons/sec at concurrency 5000 on c5.2xlarge (up from 439/sec at 1000 concurrency on c5.xlarge). CPU-bound at 90%.
Stage 4: Best Icon Selection
- CPU-bound (Postgres). Single SQL query with
DISTINCT ONand multi-column sort. Runs in seconds even at 30M — Postgres handles this efficiently with theidx_icons_host_idindex.
Stage 5: Bundle Generation
- CPU-bound (image conversion). Decoding icons (especially ICO) and re-encoding as PNG is the bottleneck. 40 converter goroutines on c5.2xlarge (8 cores) keep all cores saturated. More goroutines don't help — they just compete for cores.
- Disk I/O is secondary — reading small icon files from the sharded directory. Usually cached in the OS page cache after first access.
- S3 uploads are pipelined — 10 upload workers hide the ~50-100ms PUT latency. The assembler serializes bundles while previous uploads are in flight.
- DB reads are pipelined — the fetcher goroutine prefetches pages while converters work, so workers never wait for DB.
- Measured: 2,377 hosts/sec at concurrency 20 on c5.xlarge (4 cores). Expected ~4500+ hosts/sec at concurrency 40 on c5.2xlarge.
Stage 6: Frontend Deploy
- Network-bound. 4 small file uploads to S3 + CloudFront invalidation. Seconds.
Summary: what would make each stage faster
| Stage | Current bottleneck | To speed up further |
|---|---|---|
| CC-Index | Memory (DuckDB hash table spill) | Streaming dedup via INSERT ON CONFLICT, or more RAM |
| WARC parsing | CPU + S3 latency | More cores, or multiple EC2 instances |
| Icon download | Internet latency (slow/dead servers) | Multiple EC2 instances |
| Bundle gen | CPU (image decode/encode) | More cores, or better image libraries |
| Deploy | N/A | Already seconds |
DNS Architecture
Unbound runs on the EC2 instance as the system DNS resolver.
Configuration:
- Recursive resolver mode (no forwarding to any upstream — resolves from root servers)
- Listening on 127.0.0.1:53
- Set as system resolver in
/etc/resolv.conf - Aggressive caching enabled
- High min-TTL (3600s) — maximizes cache hits for TLD/popular nameservers
- High cache size (allocate 1-2GB RAM to Unbound)
- Prefetch enabled (refresh popular entries before expiry)
Why recursive instead of forwarding: Forwarding to Google/Cloudflare would get us rate-limited at 30M+ lookups. Recursive resolution distributes load across thousands of authoritative nameservers. With caching, the actual external query volume is much lower than 30M (most domains share TLD nameservers, many share CDN nameservers).
Transparent to Go: The Go HTTP client uses the OS resolver, which uses Unbound. No custom transport, no SNI issues, no pre-resolved IPs needed. Standard HTTPS connections with normal hostname verification.
Frontend Architecture
File Structure
index.html— minimal HTML shell, inline CSSsite.js— tab rendering logic, bundle fetching, interaction (separate file for cleanliness, cached after first load)
Requests Per Visit
GET /index.html— HTML + CSS (<10KB)GET /site.js— JavaScript (cached indefinitely via content hash in filename or cache headers)GET /tabs/{random}.json— first bundle (~150-300KB, Brotli-compressed to ~100-200KB)
Subsequent scrolls: one additional /tabs/{n}.json per "page" of tabs.
Tab Rendering
- Rows of tabs fill the viewport, styled to match the visitor's browser (Chrome, Firefox, Safari — detected via
navigator.userAgent) - Each row has a bidirectional marquee animation at varying speeds (90-150s per cycle), with random stagger to avoid synchronization
- Tabs duplicated in DOM for seamless marquee loop (
translateX(-50%)) - Each tab shows: favicon (rendered via
<img src="data:image/png;base64,...">) + truncated title - No-icon tabs: just title text, no icon
- Light mode default, auto-switches to dark mode via
prefers-color-scheme - Hover shows full title as native tooltip
Interaction
- Click tab (iframe_ok=true): Opens an inline iframe viewer between tab rows (75vh height, pushes content down)
- Click tab (iframe_ok=false): Opens site in a new tab (with
↗external-link indicator on the tab) - Close viewer: X button or Escape key. Only one viewer open at a time.
- Scroll down: When approaching the bottom, fetch next random bundle and render more rows
Randomization
- Seed:
Date.now()(milliseconds UTC) — every visitor at a different moment sees different tabs - PRNG: seeded random number generator (e.g., mulberry32 or xoshiro) for deterministic sequence from seed
- Generate random bundle indices in range
[0, TOTAL_BUNDLES) - Track fetched bundle IDs in a
Setto avoid loading duplicates on continued scroll
Future Enhancements
- Mobile-optimized layout
- "Search for a site" feature
- Stats page (how many sites, coverage, etc.)
- Performance: IntersectionObserver to pause off-screen marquee rows
Statistics & Metadata
Each pipeline stage emits a JSON stats file:
stats/
01_cc_index.json
02_warc_parse.json
03_icon_download.json
04_best_icon.json
05_bundle_gen.json
After bundle generation, these are merged into a single stats.json uploaded to everytab-site:
{
"crawl_id": "CC-MAIN-2026-05",
"generated_at": "2026-05-17T12:00:00Z",
"pipeline": {
"cc_index": {
"started_at": "2026-05-17T08:00:00Z",
"finished_at": "2026-05-17T08:42:00Z",
"duration_seconds": 2520,
"total_domains": 31245678,
"https": 28901234,
"http_only": 2344444,
"duplicates_removed": 1456789
},
"warc_parse": {
"started_at": "2026-05-17T08:45:00Z",
"finished_at": "2026-05-17T12:15:00Z",
"duration_seconds": 12600,
"processed": 31245678,
"titles_extracted": 29876543,
"icons_found": 45678901,
"iframe_restricted": 12345678,
"parse_failures": 234567
},
"icon_download": {
"started_at": "2026-05-17T12:20:00Z",
"finished_at": "2026-05-18T18:30:00Z",
"duration_seconds": 108600,
"attempted": 45678901,
"completed": 38901234,
"failed_dns": 2345678,
"failed_timeout": 1234567,
"failed_http_error": 1567890,
"failed_invalid_image": 890123,
"failed_too_large": 12345,
"unique_icons_stored": 34567890,
"dedup_hits": 4333344
},
"best_icon": {
"started_at": "2026-05-18T18:35:00Z",
"finished_at": "2026-05-18T18:40:00Z",
"duration_seconds": 300,
"hosts_with_icon": 27654321,
"hosts_without_icon": 3591357
},
"bundles": {
"started_at": "2026-05-18T18:45:00Z",
"finished_at": "2026-05-18T20:10:00Z",
"duration_seconds": 5100,
"total_bundles": 52341,
"total_hosts_included": 29876543,
"hosts_with_icon": 27654321,
"hosts_without_icon": 2222222,
"excluded_no_title": 1369135,
"avg_bundle_size_bytes": 245000
}
}
}
This is served publicly at /stats.json on the live site — interesting metadata for visitors and useful for monitoring pipeline health across crawls.
Cost Estimate
Scanning Phase (One-Time per Crawl)
| Item | Estimate |
|---|---|
| EC2 c5.2xlarge (~2-3 days) | $16-24 |
| EBS 1TB gp3 (~3 days) | $8 |
| RDS db.m5.large (~3 days) | $12-15 |
| Common Crawl S3 reads (CC-Index + WARCs) | $0 (Open Data) |
| Data transfer (icon downloads from internet, inbound) | $0 (inbound free) |
| Data transfer (backup to homelab, outbound) | $5-59 (depends on icon archive size) |
| Total | ~$41-106 |
Hosting Phase (Monthly Steady-State)
| Item | Estimate |
|---|---|
| S3 everytab-site storage (~10-15GB of bundles) | $0.35 |
| CloudFront (free tier: 1TB/month transfer, 10M requests/month) | $0 |
| S3 origin requests via CloudFront (heavily cached) | $1-3 |
| Total | ~$2-4/month |
Note: Bundle storage estimate revised down. With ~50K bundles at ~250KB each = ~12.5GB, well under previous estimate since we're targeting viewport-fill (100-150 tabs) not 1MB bundles.
If the site gets significant traffic beyond CloudFront free tier, costs scale with usage — but that's a success problem.
Scaling Strategy
Development Phase (100K domains)
- Cap CC-Index query to 100K rows
- Full pipeline runs in minutes
- Validates end-to-end correctness
- Frontend development and tab-density tuning
Full Scan (30M domains)
- Single EC2 instance, high concurrency
- CC-Index query: <1hr (httpfs) or ~2hrs (download + local query)
- WARC parsing: 2-6hrs
- Icon download: 12-48hrs (the long pole)
- Bundle generation: 1-2hrs
- Total: ~1-2 days
Fleet Scaling (if single instance is too slow)
- Spin up N identical EC2 instances running the icon downloader
- All connect to the same RDS instance
- Work claiming via
FOR UPDATE SKIP LOCKED— no double work, no coordinator - Linear throughput scaling: 4 instances ≈ 4x download speed
- Only the icon download stage benefits from fleet (other stages are fast enough solo)
Key Design Decisions
- Static-only hosting — No servers for the live site. Everything pre-built. Minimal attack surface, minimal cost.
- Inline icons in bundles — One fetch gives you 100+ tabs to render. No per-icon requests.
- Base64 + Brotli — Base64 for browser-native decoding (
atob()). Brotli compression at the CDN layer reduces transfer size by ~25-30% for free. - Unbound as system resolver — Transparent to application code. Standard Go HTTP. No custom networking.
- SHA-256 content-addressed icon storage — Natural dedup on local disk. Same favicon stored once even if referenced by multiple hosts.
- Permissive download, selective bundling — Download ALL favicon formats and sizes during scanning. Convert to optimized PNG only during bundle generation. Decouples "capture as much as possible" from "serve the best version."
- Partial index for work claiming — Indexes only unscanned rows. Shrinks as work progresses. Minimal write amplification.
- Local disk for icons, S3 for site — Icons stored on EBS during scanning (avoids ~$175 in S3 PUT costs at 30M scale). Only the static site lives in S3 behind CloudFront.
- Per-millisecond random seed — Every visitor sees a unique arrangement. No shared state, no server needed for randomization.
- Viewport-sized bundles — ~100-150 tabs per bundle, tuned to fill a screen. Faster loads, smaller memory footprint than 1MB bundles.
- Include no-icon hosts — Any host with a title is included. Firefox-style rendering (title only) for hosts without favicons.
- Denormalized best_icon_hash in hosts — Stores the SHA-256 hash of the chosen icon. Avoids joins during bundle generation. Written once during icon selection, read once during bundling.