updated ARCHITECTURE.md

This commit is contained in:
Joe Lothan 2026-05-19 23:46:06 -04:00
parent 2f1547a912
commit 258c6c5f3a

View file

@ -19,11 +19,14 @@ flowchart TD
C["Stage 3: Download Icons - Go"] C["Stage 3: Download Icons - Go"]
D["Stage 4: Select Best Icons"] D["Stage 4: Select Best Icons"]
E["Stage 5: Generate Bundles - Go"] E["Stage 5: Generate Bundles - Go"]
F["Stage 6: Build Frontend"] F["Stage 6: Deploy Frontend"]
UB["Unbound - Local recursive resolver"] UB["Unbound - Local recursive resolver"]
DISK["Local disk - Sharded icon archive"]
A --> B --> C --> D --> E --> F A --> B --> C --> D --> E --> F
UB -.-> C UB -.-> C
C --> DISK
DISK --> E
end end
subgraph ExtData["External Data"] subgraph ExtData["External Data"]
@ -32,22 +35,19 @@ flowchart TD
subgraph AWS["AWS Services"] subgraph AWS["AWS Services"]
RDS[("RDS Postgres - hosts + icons tables")] RDS[("RDS Postgres - hosts + icons tables")]
S3I["S3: everytab-icons - Raw downloaded favicons"]
S3S["S3: everytab-site - tabs/*.json + index.html"] S3S["S3: everytab-site - tabs/*.json + index.html"]
CF["CloudFront CDN"] CF["CloudFront CDN"]
end end
subgraph Post["Post-Scan"] subgraph Post["Post-Scan"]
BAK["Backup to Homelab - RDS dump + icons sync"] BAK["Backup to Homelab - RDS dump + icons rsync"]
TEAR["Teardown - Delete RDS, icons bucket, EC2"] TEAR["Teardown - Delete RDS, EC2"]
end end
CC --> A CC --> A
CC --> B CC --> B
A --> RDS A --> RDS
B --> RDS B --> RDS
B --> S3I
C --> S3I
C --> RDS C --> RDS
D --> RDS D --> RDS
E --> S3S E --> S3S
@ -66,23 +66,19 @@ All resources in **us-east-1**.
| Resource | Purpose | Lifecycle | | Resource | Purpose | Lifecycle |
|----------|---------|-----------| |----------|---------|-----------|
| EC2 (c5.xlarge) | Run all pipeline stages | Scanning only | | EC2 (c5.xlarge) + 1TB EBS | Run all pipeline stages, store icon archive | Scanning only |
| RDS Postgres (db.t3.medium) | Store hosts/icons metadata | Scanning only (backup to homelab, then delete) | | RDS Postgres (db.t3.medium) | Store hosts/icons metadata | Scanning only (backup to homelab, then delete) |
| S3 `everytab-icons` | Raw downloaded favicons | Scanning only (backup to homelab, then delete) |
| S3 `everytab-site` | Static site: index.html, site.js, tabs/*.json | Permanent | | S3 `everytab-site` | Static site: index.html, site.js, tabs/*.json | Permanent |
| CloudFront | CDN for static site (Brotli compression enabled) | Permanent | | CloudFront | CDN for static site (Brotli compression enabled) | Permanent |
| S3 `everytab-logs` | CloudFront access logs | Permanent |
| Unbound (on EC2) | Local recursive DNS resolver | Scanning only (runs on EC2) | | Unbound (on EC2) | Local recursive DNS resolver | Scanning only (runs on EC2) |
### Why Two S3 Buckets ### Icon Storage
- `everytab-site` is configured as a CloudFront origin with public read access (via OAC). The entire bucket IS the website. Icons are stored on local disk during scanning, not S3. The EBS volume holds the full icon archive in a sharded directory structure (`ab/cd/ef/{sha256}`). This avoids ~$175 in S3 PUT costs at 30M scale. After scanning completes, icons are backed up to the homelab via rsync.
- `everytab-icons` is completely private — only the EC2 instance reads/writes to it. No public access configuration needed.
- Backup is clean: `aws s3 sync s3://everytab-icons/ /homelab/path/` grabs the whole bucket.
- Deletion is clean: `aws s3 rb s3://everytab-icons --force` — zero risk of nuking the live site.
- One bucket with prefix-based policies works but is fiddlier (CloudFront must serve `tabs/` and `index.html` but NOT `icons/`). Two buckets eliminates that surface area for misconfiguration.
### Steady-State (Hosting Only) ### Steady-State (Hosting Only)
- S3 `everytab-site` — index.html + site.js + ~50K JSON bundles - S3 `everytab-site` — index.html + site.js + ~250K JSON bundles
- CloudFront distribution — Brotli-compressed delivery, caching - CloudFront distribution — Brotli-compressed delivery, caching
## Data Model ## Data Model
@ -100,8 +96,9 @@ All resources in **us-east-1**.
| warc_record_length | INT NOT NULL | Length of WARC record | | warc_record_length | INT NOT NULL | Length of WARC record |
| html_title | TEXT | Extracted from `<title>` tag | | html_title | TEXT | Extracted from `<title>` tag |
| iframe_allowed | BOOLEAN | True if site allows framing | | iframe_allowed | BOOLEAN | True if site allows framing |
| best_icon_s3_key | TEXT | S3 key of the chosen icon (denormalized for fast bundle gen) | | best_icon_s3_key | TEXT | SHA-256 hash of the chosen icon file (denormalized for fast bundle gen) |
| parsed | BOOLEAN DEFAULT FALSE | Whether WARC has been parsed | | parsed | BOOLEAN DEFAULT FALSE | Whether WARC has been parsed |
| random_order | DOUBLE PRECISION DEFAULT random() | Random value for shuffled bundle generation pagination |
### `icons` table ### `icons` table
@ -117,7 +114,7 @@ All resources in **us-east-1**.
| width | INT | Best usable pixel width (for ICO: largest standard size ≤64; for SVG: NULL) | | width | INT | Best usable pixel width (for ICO: largest standard size ≤64; for SVG: NULL) |
| height | INT | Best usable pixel height (for ICO: largest standard size ≤64; for SVG: NULL) | | height | INT | Best usable pixel height (for ICO: largest standard size ≤64; for SVG: NULL) |
| file_size | INT | Size in bytes | | file_size | INT | Size in bytes |
| s3_key | TEXT | Key in everytab-icons bucket (SHA-256 of content) | | s3_key | TEXT | SHA-256 hash of content (used as local file path, legacy column name) |
| scan_state | TEXT DEFAULT 'unscanned' | `unscanned`, `in_progress`, `completed`, `failed` | | scan_state | TEXT DEFAULT 'unscanned' | `unscanned`, `in_progress`, `completed`, `failed` |
| error | TEXT | Error message if failed | | error | TEXT | Error message if failed |
@ -125,7 +122,7 @@ All resources in **us-east-1**.
- `CREATE INDEX idx_icons_unscanned ON icons(id) WHERE scan_state = 'unscanned'` — partial index for work claiming. Only indexes unscanned rows; shrinks as work completes. Minimal write overhead since index only updates on transition OUT of 'unscanned'. - `CREATE INDEX idx_icons_unscanned ON icons(id) WHERE scan_state = 'unscanned'` — partial index for work claiming. Only indexes unscanned rows; shrinks as work completes. Minimal write overhead since index only updates on transition OUT of 'unscanned'.
- `idx_icons_host_id` on (host_id) — for best-icon selection query - `idx_icons_host_id` on (host_id) — for best-icon selection query
**S3 Key Strategy:** SHA-256 hash of the downloaded icon content. This gives free dedup at the storage layer — if two sites serve the exact same favicon bytes, we store it once. The hash is computed client-side (by the Go downloader) and used as the key. Before uploading, check if the key exists; if so, skip the upload but still record the s3_key in the icons table. **Content-Addressed Storage:** SHA-256 hash of the downloaded icon content, used as the local file path (`ab/cd/ef/{full_hash}`). This gives free dedup — if two sites serve the exact same favicon bytes, we store it once. Before writing, check if the file exists; if so, skip the write but still record the hash in the icons table.
### Bundle JSON format (`tabs/{n}.json`) ### Bundle JSON format (`tabs/{n}.json`)
@ -133,7 +130,7 @@ All resources in **us-east-1**.
{ {
"entries": [ "entries": [
{ {
"host": "example.com", "url": "https://example.com",
"title": "Example Domain", "title": "Example Domain",
"icon": "iVBORw0KGgo...", "icon": "iVBORw0KGgo...",
"icon_w": 32, "icon_w": 32,
@ -141,7 +138,7 @@ All resources in **us-east-1**.
"iframe_ok": true "iframe_ok": true
}, },
{ {
"host": "no-favicon-site.org", "url": "http://no-favicon-site.org",
"title": "A Site Without Favicon", "title": "A Site Without Favicon",
"icon": "", "icon": "",
"iframe_ok": false "iframe_ok": false
@ -152,7 +149,7 @@ All resources in **us-east-1**.
Icons are stored inline as base64-encoded PNG. Hosts without favicons are included (with `"icon": ""`) as long as they have a title. CloudFront serves bundles with Brotli compression, which significantly reduces transfer size of base64 data. Icons are stored inline as base64-encoded PNG. Hosts without favicons are included (with `"icon": ""`) as long as they have a title. CloudFront serves bundles with Brotli compression, which significantly reduces transfer size of base64 data.
Bundle size is parameterized (`ENTRIES_PER_BUNDLE`). Target: enough entries to fill a viewport plus scroll buffer. Initial estimate ~100-150 entries (~150-300KB uncompressed, smaller after Brotli). Will be tuned empirically once the frontend is built and we can measure how many tabs fill a screen. Bundle size is parameterized (`ENTRIES_PER_BUNDLE`, default 120). Tuned to fill a viewport plus scroll buffer. Average bundle size ~215KB uncompressed, significantly smaller after Brotli.
## Pipeline Stages ## Pipeline Stages
@ -160,7 +157,7 @@ The pipeline is a series of manually-run scripts executed in order on the single
### Stage 1: CC-Index Query ### Stage 1: CC-Index Query
**Tool:** DuckDB with httpfs extension (query CC parquet directly from S3; if >1hr, fall back to downloading parquet locally first) **Tool:** DuckDB with `aws` extension (credential chain) to read parquet directly from S3
**Input:** Common Crawl columnar index (parquet files on `s3://commoncrawl/cc-index/...`) **Input:** Common Crawl columnar index (parquet files on `s3://commoncrawl/cc-index/...`)
@ -190,9 +187,9 @@ WHERE url_path = '/'
**Process:** **Process:**
1. Read batches of unparsed rows (cursor-based pagination by ID) 1. Read batches of unparsed rows (cursor-based pagination by ID)
2. For each row, make a byte-range GET request to Common Crawl's S3: 2. For each row, make a byte-range S3 GetObject request to the `commoncrawl` bucket:
- `Range: bytes={offset}-{offset+length-1}` - `Range: bytes={offset}-{offset+length-1}`
- Target: `https://data.commoncrawl.org/{warc_filename}` - Uses AWS SDK (not `data.commoncrawl.org` HTTPS endpoint, which rate-limits at ~100 concurrent connections)
3. Parse the WARC record to extract the HTTP response 3. Parse the WARC record to extract the HTTP response
4. From HTTP response headers: check for `X-Frame-Options` and `Content-Security-Policy` frame-ancestors 4. From HTTP response headers: check for `X-Frame-Options` and `Content-Security-Policy` frame-ancestors
5. Parse HTML defensively (lenient parser, handle malformed HTML): 5. Parse HTML defensively (lenient parser, handle malformed HTML):
@ -220,27 +217,23 @@ WHERE url_path = '/'
**Prerequisite:** Unbound running as system resolver on the EC2 instance. **Prerequisite:** Unbound running as system resolver on the EC2 instance.
**Input:** `icons` table rows where `scan_state = 'unscanned'` and icon is worth downloading: **Input:** ALL `icons` table rows where `scan_state = 'unscanned'` — no size filter. Every `favicon_ico` and `link_rel` icon is downloaded regardless of declared size. The full archive is kept on disk; filtering happens later at best-icon selection and bundle generation.
- All `favicon_ico` entries (always attempt)
- `link_rel` entries with no declared size (unknown, could be useful)
- `link_rel` entries with declared size ≤64x64
- Skip `link_rel` entries with declared size >64x64 (192x192, 180x180, 152x152, etc. — apple-touch-icon bloat we won't use at tab scale)
**Process:** **Process:**
1. Claim batch (randomized to spread load across hosts): 1. Producer goroutine claims batches via `FOR UPDATE SKIP LOCKED`:
```sql ```sql
UPDATE icons SET scan_state = 'in_progress' UPDATE icons SET scan_state = 'in_progress'
WHERE id IN ( WHERE id IN (
SELECT id FROM icons SELECT id FROM icons
WHERE scan_state = 'unscanned' WHERE scan_state = 'unscanned'
ORDER BY md5(id::text) -- deterministic shuffle: spreads hosts apart LIMIT 5000
LIMIT N
FOR UPDATE SKIP LOCKED FOR UPDATE SKIP LOCKED
) RETURNING *; ) RETURNING id, url;
``` ```
This ensures requests to the same domain aren't back-to-back. With 30M+ icons from different hosts, a random batch of 1000 almost never contains two icons from the same server. Icons are fed into a buffered channel. N worker goroutines consume from the channel, so workers never starve between batch claims.
2. For each icon URL: 2. For each icon URL:
- Make HTTP(S) GET request (standard Go HTTP client — DNS transparently goes through Unbound) - Make HTTP(S) GET request (standard Go HTTP client — DNS transparently goes through Unbound)
- Shared `http.Transport` for connection pooling and TLS session reuse
- Enforce timeouts: 5s connect, 10s total - Enforce timeouts: 5s connect, 10s total
- Enforce max download size: 512KB (generous for icons, but prevents abuse) - Enforce max download size: 512KB (generous for icons, but prevents abuse)
- On success: - On success:
@ -250,11 +243,11 @@ WHERE url_path = '/'
- ICO: parse ICO header, find largest embedded size ≤64x64 at a standard dimension (16/32/48/64), store THAT size in width/height - ICO: parse ICO header, find largest embedded size ≤64x64 at a standard dimension (16/32/48/64), store THAT size in width/height
- SVG: store width=NULL, height=NULL (vector, no pixel size) - SVG: store width=NULL, height=NULL (vector, no pixel size)
- Compute SHA-256 of content - Compute SHA-256 of content
- Upload to S3 `everytab-icons/{sha256}` (skip if key already exists — dedup) - Write to local disk at `{icons_dir}/ab/cd/ef/{sha256}` (skip if file already exists — dedup)
- Update icons row: s3_key, content_type (from actual data, not HTTP header), width, height, file_size, scan_state = 'completed' - Update icons row: s3_key (the SHA-256 hash), content_type (from actual data, not HTTP header), width, height, file_size, scan_state = 'completed'
- On failure: scan_state = 'failed', error = reason - On failure: scan_state = 'failed', error = reason
**Concurrency:** Goroutine pool with configurable size (start 1000, tune based on system resources). Semaphore pattern for backpressure. Monitor memory usage. **Concurrency:** Channel-based worker pool (default 200 workers, configurable). Producer goroutine feeds a buffered channel (buffer = batch size), N workers consume. No starvation between batch claims.
**Fast failure strategy:** **Fast failure strategy:**
- DNS failure → fail immediately (Unbound will cache NXDOMAIN) - DNS failure → fail immediately (Unbound will cache NXDOMAIN)
@ -263,14 +256,14 @@ WHERE url_path = '/'
- Too large → abort read at 512KB boundary - Too large → abort read at 512KB boundary
- Not an image → fail (record content-type in error) - Not an image → fail (record content-type in error)
**Permissive on format:** Download everything — ICO, PNG, GIF, SVG, WebP, JPEG, BMP, whatever the server returns. Store the raw bytes in S3. Format filtering and conversion happens later in bundle generation. **Permissive on format:** Download everything — ICO, PNG, GIF, SVG, WebP, JPEG, BMP, whatever the server returns. Store the raw bytes on disk. Format filtering and conversion happens later in bundle generation.
**Scaling to fleet (if needed):** **Scaling to fleet (if needed):**
- Multiple EC2 instances run the same binary - Multiple EC2 instances run the same binary
- Each claims work via Postgres row-level locking (`FOR UPDATE SKIP LOCKED`) - Each claims work via Postgres row-level locking (`FOR UPDATE SKIP LOCKED`)
- No coordinator needed — linear scaling with instance count - No coordinator needed — linear scaling with instance count
**Stats emitted:** Icons attempted, completed, failed (breakdown by error type: DNS, timeout, connection refused, HTTP 4xx, HTTP 5xx, invalid image, too large), icons/sec rate, bytes downloaded, unique S3 keys (dedup hits). **Stats emitted:** Icons attempted, completed, failed (breakdown by error type: DNS, timeout, connection refused, HTTP 4xx, HTTP 5xx, invalid image, too large), icons/sec rate, bytes downloaded, dedup hits.
### Stage 4: Best Icon Selection ### Stage 4: Best Icon Selection
@ -302,47 +295,39 @@ Uses `DISTINCT ON (host_id)` for efficient single-pass selection. See `pipeline/
**Input:** All hosts where `html_title IS NOT NULL` (include hosts without icons) **Input:** All hosts where `html_title IS NOT NULL` (include hosts without icons)
**Process:** **Process:**
1. Query all qualifying hosts from RDS (with their best_icon_s3_key) 1. Stream hosts from RDS in pages (keyset pagination on `random_order` column for shuffled output)
2. Randomize the full result set 2. For each page, concurrently convert icons (configurable concurrency, default 200):
3. For each host with an icon (best_icon_s3_key IS NOT NULL): - Read icon from local disk at `{icons_dir}/ab/cd/ef/{hash}`
- Download from S3 `everytab-icons/{s3_key}` - Decode the image via Go's `image.Decode` (handles PNG, GIF, JPEG, WebP, ICO via registered decoders)
- Decode the image based on format: - SVGs are excluded (no rasterizer) — these hosts appear without icons
- ICO: parse container, extract the image at the size recorded in width/height (the largest standard size ≤64x64). ICO can embed BMP or PNG internally — decode whichever is present. - Icons >128px downscaled to 32x32 (nearest-neighbor). Icons ≤128px kept as-is.
- PNG: decode directly - Re-encode as PNG, base64-encode
- GIF/WebP/BMP/JPEG: decode to raster 3. Converted entries accumulate in a buffer. Every 120 entries (configurable), serialize as JSON and upload to S3
- SVG: rasterize to 32x32 (use a Go SVG rasterizer library) 4. Hosts without icons: included with `"icon": ""`
- Re-encode as optimized PNG at original dimensions (never upscale — a 16x16 stays 16x16) 5. Final partial bundle written at end
- Base64-encode the PNG bytes
4. For hosts without icons: set icon to empty string
5. Chunk into groups of `ENTRIES_PER_BUNDLE` entries (parameterized, initially ~100-150, tuned to viewport fill)
6. Serialize each chunk as JSON, write to S3 `everytab-site/tabs/{n}.json`
7. Record total bundle count
**Output:** **Output:**
- `tabs/0.json` through `tabs/{M}.json` in S3 `everytab-site` - `tabs/0000.json` through `tabs/{M}.json` in S3 `everytab-site`
- Total bundle count M - Total bundle count M (bake into frontend via deploy script)
- `stats.json` in S3 `everytab-site` (pipeline statistics)
**Stats emitted:** Total bundles created, total hosts included (with icon / without icon), average bundle size (bytes), total S3 storage used, icon conversion failures. **Stats emitted:** Total bundles created, total hosts included (with icon / without icon), average bundle size (bytes), total S3 storage used, icon conversion failures.
### Stage 6: Frontend Build ### Stage 6: Frontend Deploy
**Tool:** Simple script or template engine **Tool:** `pipeline/06_frontend/deploy.sh`
**Process:** **Process:**
1. Inject `const TOTAL_BUNDLES = {M};` into the JS 1. `sed` injects `const TOTAL_BUNDLES = {M};` into a temp copy of `index.html`
2. Write `index.html` and `site.js` to S3 `everytab-site` 2. Uploads `index.html`, `site.js`, `bot.html`, `about.html` to S3 `everytab-site`
3. Invalidate CloudFront distribution (`/*`) 3. Invalidates CloudFront cache for all four files (auto-detects distribution ID)
### Stage 7: Backup & Teardown ### Stage 7: Backup & Teardown
**Process (manual, with confirmation at each step):** **Process (manual, with confirmation at each step):**
1. Dump RDS database: `pg_dump` → transfer to homelab 1. Dump RDS database: `pg_dump -Fc` → transfer to homelab via rsync
2. Sync icons: `aws s3 sync s3://everytab-icons/ homelab:/path/to/backup/icons/` 2. Sync icons from local disk: `rsync -avP ~/icons/ homelab:/backups/everytab/icons/`
3. **Verify backups:** confirm pg_dump restores cleanly on homelab, spot-check icon files 3. **Verify backups:** confirm pg_dump restores cleanly on homelab, spot-check icon files
4. Delete RDS instance (skip final snapshot — homelab backup is the source of truth, snapshots cost $0.095/GB-month) 4. Tear down scanning infra: `terraform apply -var="scanning=false"` (deletes RDS, EC2, icons S3 bucket)
5. Delete S3 `everytab-icons` bucket
6. Terminate EC2 instance
## DNS Architecture ## DNS Architecture
@ -376,18 +361,19 @@ Subsequent scrolls: one additional `/tabs/{n}.json` per "page" of tabs.
### Tab Rendering ### Tab Rendering
- Rows of tabs fill the viewport, styled to mimic Firefox browser tabs (v1) - Rows of tabs fill the viewport, styled to match the visitor's browser (Chrome, Firefox, Safari — detected via `navigator.userAgent`)
- Each row has a subtle horizontal marquee animation (CSS `@keyframes` / `animation`) at slightly varying speeds - Each row has a bidirectional marquee animation at varying speeds (90-150s per cycle), with random stagger to avoid synchronization
- Tab density adapts to viewport width (responsive) - Tabs duplicated in DOM for seamless marquee loop (`translateX(-50%)`)
- Each tab shows: favicon (rendered via `<img src="data:image/png;base64,...">`) + truncated title - Each tab shows: favicon (rendered via `<img src="data:image/png;base64,...">`) + truncated title
- No-icon tabs: just title text, no icon (Firefox behavior) - No-icon tabs: just title text, no icon
- Enough tabs rendered to fill viewport + buffer below fold (so user can scroll immediately without waiting for next fetch) - Light mode default, auto-switches to dark mode via `prefers-color-scheme`
- Hover shows full title as native tooltip
### Interaction ### Interaction
- **Click tab (iframe_ok=true):** Opens an iframe overlay showing the actual site - **Click tab (iframe_ok=true):** Opens an inline iframe viewer between tab rows (75vh height, pushes content down)
- **Click tab (iframe_ok=false):** Opens site in a new tab (with subtle external-link indicator on the tab) - **Click tab (iframe_ok=false):** Opens site in a new tab (with `↗` external-link indicator on the tab)
- **Close overlay:** X button or click outside dismisses iframe - **Close viewer:** X button or Escape key. Only one viewer open at a time.
- **Scroll down:** When approaching the bottom, fetch next random bundle and render more rows - **Scroll down:** When approaching the bottom, fetch next random bundle and render more rows
### Randomization ### Randomization
@ -397,11 +383,11 @@ Subsequent scrolls: one additional `/tabs/{n}.json` per "page" of tabs.
- Generate random bundle indices in range `[0, TOTAL_BUNDLES)` - Generate random bundle indices in range `[0, TOTAL_BUNDLES)`
- Track fetched bundle IDs in a `Set` to avoid loading duplicates on continued scroll - Track fetched bundle IDs in a `Set` to avoid loading duplicates on continued scroll
### Future Enhancements (v2+) ### Future Enhancements
- Browser-specific tab styles (Chrome tabs for Chrome users, Safari for Safari, etc.)
- Mobile-optimized layout - Mobile-optimized layout
- "Search for a site" feature - "Search for a site" feature
- Stats page (how many sites, coverage, etc.) - Stats page (how many sites, coverage, etc.)
- Performance: IntersectionObserver to pause off-screen marquee rows
## Statistics & Metadata ## Statistics & Metadata
@ -486,14 +472,13 @@ This is served publicly at `/stats.json` on the live site — interesting metada
| Item | Estimate | | Item | Estimate |
|------|----------| |------|----------|
| EC2 c5.xlarge (~24-48hrs) | $8-16 | | EC2 c5.xlarge (~3-4 days) | $12-16 |
| RDS db.t3.medium (~48-72hrs including dev time) | $3-7 | | EBS 1TB gp3 (~4 days) | $10 |
| S3 everytab-icons storage (~500GB, prorated to days) | $1-3 | | RDS db.t3.medium (~4 days) | $4-6 |
| S3 PUT requests (icon uploads, ~30M) | $15 |
| Common Crawl S3 reads (CC-Index + WARCs) | $0 (Open Data) | | Common Crawl S3 reads (CC-Index + WARCs) | $0 (Open Data) |
| Data transfer (icon downloads from internet, inbound) | $0 (inbound free) | | Data transfer (icon downloads from internet, inbound) | $0 (inbound free) |
| Data transfer (backup to homelab, outbound) | $5-10 | | Data transfer (backup to homelab, outbound) | $5-45 (depends on icon archive size) |
| **Total** | **~$32-51** | | **Total** | **~$31-77** |
### Hosting Phase (Monthly Steady-State) ### Hosting Phase (Monthly Steady-State)
@ -537,11 +522,11 @@ If the site gets significant traffic beyond CloudFront free tier, costs scale wi
2. **Inline icons in bundles** — One fetch gives you 100+ tabs to render. No per-icon requests. 2. **Inline icons in bundles** — One fetch gives you 100+ tabs to render. No per-icon requests.
3. **Base64 + Brotli** — Base64 for browser-native decoding (`atob()`). Brotli compression at the CDN layer reduces transfer size by ~25-30% for free. 3. **Base64 + Brotli** — Base64 for browser-native decoding (`atob()`). Brotli compression at the CDN layer reduces transfer size by ~25-30% for free.
4. **Unbound as system resolver** — Transparent to application code. Standard Go HTTP. No custom networking. 4. **Unbound as system resolver** — Transparent to application code. Standard Go HTTP. No custom networking.
5. **SHA-256 content-addressed icon storage** — Natural dedup at S3 layer. Same favicon stored once even if referenced by multiple hosts. 5. **SHA-256 content-addressed icon storage** — Natural dedup on local disk. Same favicon stored once even if referenced by multiple hosts.
6. **Permissive download, selective bundling** — Download ALL favicon formats during scanning. Convert to optimized PNG only during bundle generation. Decouples "capture as much as possible" from "serve the best version." 6. **Permissive download, selective bundling** — Download ALL favicon formats and sizes during scanning. Convert to optimized PNG only during bundle generation. Decouples "capture as much as possible" from "serve the best version."
7. **Partial index for work claiming** — Indexes only unscanned rows. Shrinks as work progresses. Minimal write amplification. 7. **Partial index for work claiming** — Indexes only unscanned rows. Shrinks as work progresses. Minimal write amplification.
8. **Two S3 buckets** — Clean separation of concerns. Private working storage vs public site. Safe deletion of temporary data. 8. **Local disk for icons, S3 for site** — Icons stored on EBS during scanning (avoids ~$175 in S3 PUT costs at 30M scale). Only the static site lives in S3 behind CloudFront.
9. **Per-millisecond random seed** — Every visitor sees a unique arrangement. No shared state, no server needed for randomization. 9. **Per-millisecond random seed** — Every visitor sees a unique arrangement. No shared state, no server needed for randomization.
10. **Viewport-sized bundles** — ~100-150 tabs per bundle, tuned to fill a screen. Faster loads, smaller memory footprint than 1MB bundles. 10. **Viewport-sized bundles** — ~100-150 tabs per bundle, tuned to fill a screen. Faster loads, smaller memory footprint than 1MB bundles.
11. **Include no-icon hosts** — Any host with a title is included. Firefox-style rendering (title only) for hosts without favicons. 11. **Include no-icon hosts** — Any host with a title is included. Firefox-style rendering (title only) for hosts without favicons.
12. **Denormalized best_icon_s3_key in hosts** — Avoids joins during bundle generation. Written once during icon selection, read once during bundling. 12. **Denormalized best_icon_s3_key in hosts** Stores the SHA-256 hash of the chosen icon. Avoids joins during bundle generation. Written once during icon selection, read once during bundling.