updated ARCHITECTURE.md
This commit is contained in:
parent
2f1547a912
commit
258c6c5f3a
1 changed files with 72 additions and 87 deletions
159
ARCHITECTURE.md
159
ARCHITECTURE.md
|
|
@ -19,11 +19,14 @@ flowchart TD
|
||||||
C["Stage 3: Download Icons - Go"]
|
C["Stage 3: Download Icons - Go"]
|
||||||
D["Stage 4: Select Best Icons"]
|
D["Stage 4: Select Best Icons"]
|
||||||
E["Stage 5: Generate Bundles - Go"]
|
E["Stage 5: Generate Bundles - Go"]
|
||||||
F["Stage 6: Build Frontend"]
|
F["Stage 6: Deploy Frontend"]
|
||||||
UB["Unbound - Local recursive resolver"]
|
UB["Unbound - Local recursive resolver"]
|
||||||
|
DISK["Local disk - Sharded icon archive"]
|
||||||
|
|
||||||
A --> B --> C --> D --> E --> F
|
A --> B --> C --> D --> E --> F
|
||||||
UB -.-> C
|
UB -.-> C
|
||||||
|
C --> DISK
|
||||||
|
DISK --> E
|
||||||
end
|
end
|
||||||
|
|
||||||
subgraph ExtData["External Data"]
|
subgraph ExtData["External Data"]
|
||||||
|
|
@ -32,22 +35,19 @@ flowchart TD
|
||||||
|
|
||||||
subgraph AWS["AWS Services"]
|
subgraph AWS["AWS Services"]
|
||||||
RDS[("RDS Postgres - hosts + icons tables")]
|
RDS[("RDS Postgres - hosts + icons tables")]
|
||||||
S3I["S3: everytab-icons - Raw downloaded favicons"]
|
|
||||||
S3S["S3: everytab-site - tabs/*.json + index.html"]
|
S3S["S3: everytab-site - tabs/*.json + index.html"]
|
||||||
CF["CloudFront CDN"]
|
CF["CloudFront CDN"]
|
||||||
end
|
end
|
||||||
|
|
||||||
subgraph Post["Post-Scan"]
|
subgraph Post["Post-Scan"]
|
||||||
BAK["Backup to Homelab - RDS dump + icons sync"]
|
BAK["Backup to Homelab - RDS dump + icons rsync"]
|
||||||
TEAR["Teardown - Delete RDS, icons bucket, EC2"]
|
TEAR["Teardown - Delete RDS, EC2"]
|
||||||
end
|
end
|
||||||
|
|
||||||
CC --> A
|
CC --> A
|
||||||
CC --> B
|
CC --> B
|
||||||
A --> RDS
|
A --> RDS
|
||||||
B --> RDS
|
B --> RDS
|
||||||
B --> S3I
|
|
||||||
C --> S3I
|
|
||||||
C --> RDS
|
C --> RDS
|
||||||
D --> RDS
|
D --> RDS
|
||||||
E --> S3S
|
E --> S3S
|
||||||
|
|
@ -66,23 +66,19 @@ All resources in **us-east-1**.
|
||||||
|
|
||||||
| Resource | Purpose | Lifecycle |
|
| Resource | Purpose | Lifecycle |
|
||||||
|----------|---------|-----------|
|
|----------|---------|-----------|
|
||||||
| EC2 (c5.xlarge) | Run all pipeline stages | Scanning only |
|
| EC2 (c5.xlarge) + 1TB EBS | Run all pipeline stages, store icon archive | Scanning only |
|
||||||
| RDS Postgres (db.t3.medium) | Store hosts/icons metadata | Scanning only (backup to homelab, then delete) |
|
| RDS Postgres (db.t3.medium) | Store hosts/icons metadata | Scanning only (backup to homelab, then delete) |
|
||||||
| S3 `everytab-icons` | Raw downloaded favicons | Scanning only (backup to homelab, then delete) |
|
|
||||||
| S3 `everytab-site` | Static site: index.html, site.js, tabs/*.json | Permanent |
|
| S3 `everytab-site` | Static site: index.html, site.js, tabs/*.json | Permanent |
|
||||||
| CloudFront | CDN for static site (Brotli compression enabled) | Permanent |
|
| CloudFront | CDN for static site (Brotli compression enabled) | Permanent |
|
||||||
|
| S3 `everytab-logs` | CloudFront access logs | Permanent |
|
||||||
| Unbound (on EC2) | Local recursive DNS resolver | Scanning only (runs on EC2) |
|
| Unbound (on EC2) | Local recursive DNS resolver | Scanning only (runs on EC2) |
|
||||||
|
|
||||||
### Why Two S3 Buckets
|
### Icon Storage
|
||||||
|
|
||||||
- `everytab-site` is configured as a CloudFront origin with public read access (via OAC). The entire bucket IS the website.
|
Icons are stored on local disk during scanning, not S3. The EBS volume holds the full icon archive in a sharded directory structure (`ab/cd/ef/{sha256}`). This avoids ~$175 in S3 PUT costs at 30M scale. After scanning completes, icons are backed up to the homelab via rsync.
|
||||||
- `everytab-icons` is completely private — only the EC2 instance reads/writes to it. No public access configuration needed.
|
|
||||||
- Backup is clean: `aws s3 sync s3://everytab-icons/ /homelab/path/` grabs the whole bucket.
|
|
||||||
- Deletion is clean: `aws s3 rb s3://everytab-icons --force` — zero risk of nuking the live site.
|
|
||||||
- One bucket with prefix-based policies works but is fiddlier (CloudFront must serve `tabs/` and `index.html` but NOT `icons/`). Two buckets eliminates that surface area for misconfiguration.
|
|
||||||
|
|
||||||
### Steady-State (Hosting Only)
|
### Steady-State (Hosting Only)
|
||||||
- S3 `everytab-site` — index.html + site.js + ~50K JSON bundles
|
- S3 `everytab-site` — index.html + site.js + ~250K JSON bundles
|
||||||
- CloudFront distribution — Brotli-compressed delivery, caching
|
- CloudFront distribution — Brotli-compressed delivery, caching
|
||||||
|
|
||||||
## Data Model
|
## Data Model
|
||||||
|
|
@ -100,8 +96,9 @@ All resources in **us-east-1**.
|
||||||
| warc_record_length | INT NOT NULL | Length of WARC record |
|
| warc_record_length | INT NOT NULL | Length of WARC record |
|
||||||
| html_title | TEXT | Extracted from `<title>` tag |
|
| html_title | TEXT | Extracted from `<title>` tag |
|
||||||
| iframe_allowed | BOOLEAN | True if site allows framing |
|
| iframe_allowed | BOOLEAN | True if site allows framing |
|
||||||
| best_icon_s3_key | TEXT | S3 key of the chosen icon (denormalized for fast bundle gen) |
|
| best_icon_s3_key | TEXT | SHA-256 hash of the chosen icon file (denormalized for fast bundle gen) |
|
||||||
| parsed | BOOLEAN DEFAULT FALSE | Whether WARC has been parsed |
|
| parsed | BOOLEAN DEFAULT FALSE | Whether WARC has been parsed |
|
||||||
|
| random_order | DOUBLE PRECISION DEFAULT random() | Random value for shuffled bundle generation pagination |
|
||||||
|
|
||||||
### `icons` table
|
### `icons` table
|
||||||
|
|
||||||
|
|
@ -117,7 +114,7 @@ All resources in **us-east-1**.
|
||||||
| width | INT | Best usable pixel width (for ICO: largest standard size ≤64; for SVG: NULL) |
|
| width | INT | Best usable pixel width (for ICO: largest standard size ≤64; for SVG: NULL) |
|
||||||
| height | INT | Best usable pixel height (for ICO: largest standard size ≤64; for SVG: NULL) |
|
| height | INT | Best usable pixel height (for ICO: largest standard size ≤64; for SVG: NULL) |
|
||||||
| file_size | INT | Size in bytes |
|
| file_size | INT | Size in bytes |
|
||||||
| s3_key | TEXT | Key in everytab-icons bucket (SHA-256 of content) |
|
| s3_key | TEXT | SHA-256 hash of content (used as local file path, legacy column name) |
|
||||||
| scan_state | TEXT DEFAULT 'unscanned' | `unscanned`, `in_progress`, `completed`, `failed` |
|
| scan_state | TEXT DEFAULT 'unscanned' | `unscanned`, `in_progress`, `completed`, `failed` |
|
||||||
| error | TEXT | Error message if failed |
|
| error | TEXT | Error message if failed |
|
||||||
|
|
||||||
|
|
@ -125,7 +122,7 @@ All resources in **us-east-1**.
|
||||||
- `CREATE INDEX idx_icons_unscanned ON icons(id) WHERE scan_state = 'unscanned'` — partial index for work claiming. Only indexes unscanned rows; shrinks as work completes. Minimal write overhead since index only updates on transition OUT of 'unscanned'.
|
- `CREATE INDEX idx_icons_unscanned ON icons(id) WHERE scan_state = 'unscanned'` — partial index for work claiming. Only indexes unscanned rows; shrinks as work completes. Minimal write overhead since index only updates on transition OUT of 'unscanned'.
|
||||||
- `idx_icons_host_id` on (host_id) — for best-icon selection query
|
- `idx_icons_host_id` on (host_id) — for best-icon selection query
|
||||||
|
|
||||||
**S3 Key Strategy:** SHA-256 hash of the downloaded icon content. This gives free dedup at the storage layer — if two sites serve the exact same favicon bytes, we store it once. The hash is computed client-side (by the Go downloader) and used as the key. Before uploading, check if the key exists; if so, skip the upload but still record the s3_key in the icons table.
|
**Content-Addressed Storage:** SHA-256 hash of the downloaded icon content, used as the local file path (`ab/cd/ef/{full_hash}`). This gives free dedup — if two sites serve the exact same favicon bytes, we store it once. Before writing, check if the file exists; if so, skip the write but still record the hash in the icons table.
|
||||||
|
|
||||||
### Bundle JSON format (`tabs/{n}.json`)
|
### Bundle JSON format (`tabs/{n}.json`)
|
||||||
|
|
||||||
|
|
@ -133,7 +130,7 @@ All resources in **us-east-1**.
|
||||||
{
|
{
|
||||||
"entries": [
|
"entries": [
|
||||||
{
|
{
|
||||||
"host": "example.com",
|
"url": "https://example.com",
|
||||||
"title": "Example Domain",
|
"title": "Example Domain",
|
||||||
"icon": "iVBORw0KGgo...",
|
"icon": "iVBORw0KGgo...",
|
||||||
"icon_w": 32,
|
"icon_w": 32,
|
||||||
|
|
@ -141,7 +138,7 @@ All resources in **us-east-1**.
|
||||||
"iframe_ok": true
|
"iframe_ok": true
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"host": "no-favicon-site.org",
|
"url": "http://no-favicon-site.org",
|
||||||
"title": "A Site Without Favicon",
|
"title": "A Site Without Favicon",
|
||||||
"icon": "",
|
"icon": "",
|
||||||
"iframe_ok": false
|
"iframe_ok": false
|
||||||
|
|
@ -152,7 +149,7 @@ All resources in **us-east-1**.
|
||||||
|
|
||||||
Icons are stored inline as base64-encoded PNG. Hosts without favicons are included (with `"icon": ""`) as long as they have a title. CloudFront serves bundles with Brotli compression, which significantly reduces transfer size of base64 data.
|
Icons are stored inline as base64-encoded PNG. Hosts without favicons are included (with `"icon": ""`) as long as they have a title. CloudFront serves bundles with Brotli compression, which significantly reduces transfer size of base64 data.
|
||||||
|
|
||||||
Bundle size is parameterized (`ENTRIES_PER_BUNDLE`). Target: enough entries to fill a viewport plus scroll buffer. Initial estimate ~100-150 entries (~150-300KB uncompressed, smaller after Brotli). Will be tuned empirically once the frontend is built and we can measure how many tabs fill a screen.
|
Bundle size is parameterized (`ENTRIES_PER_BUNDLE`, default 120). Tuned to fill a viewport plus scroll buffer. Average bundle size ~215KB uncompressed, significantly smaller after Brotli.
|
||||||
|
|
||||||
## Pipeline Stages
|
## Pipeline Stages
|
||||||
|
|
||||||
|
|
@ -160,7 +157,7 @@ The pipeline is a series of manually-run scripts executed in order on the single
|
||||||
|
|
||||||
### Stage 1: CC-Index Query
|
### Stage 1: CC-Index Query
|
||||||
|
|
||||||
**Tool:** DuckDB with httpfs extension (query CC parquet directly from S3; if >1hr, fall back to downloading parquet locally first)
|
**Tool:** DuckDB with `aws` extension (credential chain) to read parquet directly from S3
|
||||||
|
|
||||||
**Input:** Common Crawl columnar index (parquet files on `s3://commoncrawl/cc-index/...`)
|
**Input:** Common Crawl columnar index (parquet files on `s3://commoncrawl/cc-index/...`)
|
||||||
|
|
||||||
|
|
@ -190,9 +187,9 @@ WHERE url_path = '/'
|
||||||
|
|
||||||
**Process:**
|
**Process:**
|
||||||
1. Read batches of unparsed rows (cursor-based pagination by ID)
|
1. Read batches of unparsed rows (cursor-based pagination by ID)
|
||||||
2. For each row, make a byte-range GET request to Common Crawl's S3:
|
2. For each row, make a byte-range S3 GetObject request to the `commoncrawl` bucket:
|
||||||
- `Range: bytes={offset}-{offset+length-1}`
|
- `Range: bytes={offset}-{offset+length-1}`
|
||||||
- Target: `https://data.commoncrawl.org/{warc_filename}`
|
- Uses AWS SDK (not `data.commoncrawl.org` HTTPS endpoint, which rate-limits at ~100 concurrent connections)
|
||||||
3. Parse the WARC record to extract the HTTP response
|
3. Parse the WARC record to extract the HTTP response
|
||||||
4. From HTTP response headers: check for `X-Frame-Options` and `Content-Security-Policy` frame-ancestors
|
4. From HTTP response headers: check for `X-Frame-Options` and `Content-Security-Policy` frame-ancestors
|
||||||
5. Parse HTML defensively (lenient parser, handle malformed HTML):
|
5. Parse HTML defensively (lenient parser, handle malformed HTML):
|
||||||
|
|
@ -220,27 +217,23 @@ WHERE url_path = '/'
|
||||||
|
|
||||||
**Prerequisite:** Unbound running as system resolver on the EC2 instance.
|
**Prerequisite:** Unbound running as system resolver on the EC2 instance.
|
||||||
|
|
||||||
**Input:** `icons` table rows where `scan_state = 'unscanned'` and icon is worth downloading:
|
**Input:** ALL `icons` table rows where `scan_state = 'unscanned'` — no size filter. Every `favicon_ico` and `link_rel` icon is downloaded regardless of declared size. The full archive is kept on disk; filtering happens later at best-icon selection and bundle generation.
|
||||||
- All `favicon_ico` entries (always attempt)
|
|
||||||
- `link_rel` entries with no declared size (unknown, could be useful)
|
|
||||||
- `link_rel` entries with declared size ≤64x64
|
|
||||||
- Skip `link_rel` entries with declared size >64x64 (192x192, 180x180, 152x152, etc. — apple-touch-icon bloat we won't use at tab scale)
|
|
||||||
|
|
||||||
**Process:**
|
**Process:**
|
||||||
1. Claim batch (randomized to spread load across hosts):
|
1. Producer goroutine claims batches via `FOR UPDATE SKIP LOCKED`:
|
||||||
```sql
|
```sql
|
||||||
UPDATE icons SET scan_state = 'in_progress'
|
UPDATE icons SET scan_state = 'in_progress'
|
||||||
WHERE id IN (
|
WHERE id IN (
|
||||||
SELECT id FROM icons
|
SELECT id FROM icons
|
||||||
WHERE scan_state = 'unscanned'
|
WHERE scan_state = 'unscanned'
|
||||||
ORDER BY md5(id::text) -- deterministic shuffle: spreads hosts apart
|
LIMIT 5000
|
||||||
LIMIT N
|
|
||||||
FOR UPDATE SKIP LOCKED
|
FOR UPDATE SKIP LOCKED
|
||||||
) RETURNING *;
|
) RETURNING id, url;
|
||||||
```
|
```
|
||||||
This ensures requests to the same domain aren't back-to-back. With 30M+ icons from different hosts, a random batch of 1000 almost never contains two icons from the same server.
|
Icons are fed into a buffered channel. N worker goroutines consume from the channel, so workers never starve between batch claims.
|
||||||
2. For each icon URL:
|
2. For each icon URL:
|
||||||
- Make HTTP(S) GET request (standard Go HTTP client — DNS transparently goes through Unbound)
|
- Make HTTP(S) GET request (standard Go HTTP client — DNS transparently goes through Unbound)
|
||||||
|
- Shared `http.Transport` for connection pooling and TLS session reuse
|
||||||
- Enforce timeouts: 5s connect, 10s total
|
- Enforce timeouts: 5s connect, 10s total
|
||||||
- Enforce max download size: 512KB (generous for icons, but prevents abuse)
|
- Enforce max download size: 512KB (generous for icons, but prevents abuse)
|
||||||
- On success:
|
- On success:
|
||||||
|
|
@ -250,11 +243,11 @@ WHERE url_path = '/'
|
||||||
- ICO: parse ICO header, find largest embedded size ≤64x64 at a standard dimension (16/32/48/64), store THAT size in width/height
|
- ICO: parse ICO header, find largest embedded size ≤64x64 at a standard dimension (16/32/48/64), store THAT size in width/height
|
||||||
- SVG: store width=NULL, height=NULL (vector, no pixel size)
|
- SVG: store width=NULL, height=NULL (vector, no pixel size)
|
||||||
- Compute SHA-256 of content
|
- Compute SHA-256 of content
|
||||||
- Upload to S3 `everytab-icons/{sha256}` (skip if key already exists — dedup)
|
- Write to local disk at `{icons_dir}/ab/cd/ef/{sha256}` (skip if file already exists — dedup)
|
||||||
- Update icons row: s3_key, content_type (from actual data, not HTTP header), width, height, file_size, scan_state = 'completed'
|
- Update icons row: s3_key (the SHA-256 hash), content_type (from actual data, not HTTP header), width, height, file_size, scan_state = 'completed'
|
||||||
- On failure: scan_state = 'failed', error = reason
|
- On failure: scan_state = 'failed', error = reason
|
||||||
|
|
||||||
**Concurrency:** Goroutine pool with configurable size (start 1000, tune based on system resources). Semaphore pattern for backpressure. Monitor memory usage.
|
**Concurrency:** Channel-based worker pool (default 200 workers, configurable). Producer goroutine feeds a buffered channel (buffer = batch size), N workers consume. No starvation between batch claims.
|
||||||
|
|
||||||
**Fast failure strategy:**
|
**Fast failure strategy:**
|
||||||
- DNS failure → fail immediately (Unbound will cache NXDOMAIN)
|
- DNS failure → fail immediately (Unbound will cache NXDOMAIN)
|
||||||
|
|
@ -263,14 +256,14 @@ WHERE url_path = '/'
|
||||||
- Too large → abort read at 512KB boundary
|
- Too large → abort read at 512KB boundary
|
||||||
- Not an image → fail (record content-type in error)
|
- Not an image → fail (record content-type in error)
|
||||||
|
|
||||||
**Permissive on format:** Download everything — ICO, PNG, GIF, SVG, WebP, JPEG, BMP, whatever the server returns. Store the raw bytes in S3. Format filtering and conversion happens later in bundle generation.
|
**Permissive on format:** Download everything — ICO, PNG, GIF, SVG, WebP, JPEG, BMP, whatever the server returns. Store the raw bytes on disk. Format filtering and conversion happens later in bundle generation.
|
||||||
|
|
||||||
**Scaling to fleet (if needed):**
|
**Scaling to fleet (if needed):**
|
||||||
- Multiple EC2 instances run the same binary
|
- Multiple EC2 instances run the same binary
|
||||||
- Each claims work via Postgres row-level locking (`FOR UPDATE SKIP LOCKED`)
|
- Each claims work via Postgres row-level locking (`FOR UPDATE SKIP LOCKED`)
|
||||||
- No coordinator needed — linear scaling with instance count
|
- No coordinator needed — linear scaling with instance count
|
||||||
|
|
||||||
**Stats emitted:** Icons attempted, completed, failed (breakdown by error type: DNS, timeout, connection refused, HTTP 4xx, HTTP 5xx, invalid image, too large), icons/sec rate, bytes downloaded, unique S3 keys (dedup hits).
|
**Stats emitted:** Icons attempted, completed, failed (breakdown by error type: DNS, timeout, connection refused, HTTP 4xx, HTTP 5xx, invalid image, too large), icons/sec rate, bytes downloaded, dedup hits.
|
||||||
|
|
||||||
### Stage 4: Best Icon Selection
|
### Stage 4: Best Icon Selection
|
||||||
|
|
||||||
|
|
@ -302,47 +295,39 @@ Uses `DISTINCT ON (host_id)` for efficient single-pass selection. See `pipeline/
|
||||||
**Input:** All hosts where `html_title IS NOT NULL` (include hosts without icons)
|
**Input:** All hosts where `html_title IS NOT NULL` (include hosts without icons)
|
||||||
|
|
||||||
**Process:**
|
**Process:**
|
||||||
1. Query all qualifying hosts from RDS (with their best_icon_s3_key)
|
1. Stream hosts from RDS in pages (keyset pagination on `random_order` column for shuffled output)
|
||||||
2. Randomize the full result set
|
2. For each page, concurrently convert icons (configurable concurrency, default 200):
|
||||||
3. For each host with an icon (best_icon_s3_key IS NOT NULL):
|
- Read icon from local disk at `{icons_dir}/ab/cd/ef/{hash}`
|
||||||
- Download from S3 `everytab-icons/{s3_key}`
|
- Decode the image via Go's `image.Decode` (handles PNG, GIF, JPEG, WebP, ICO via registered decoders)
|
||||||
- Decode the image based on format:
|
- SVGs are excluded (no rasterizer) — these hosts appear without icons
|
||||||
- ICO: parse container, extract the image at the size recorded in width/height (the largest standard size ≤64x64). ICO can embed BMP or PNG internally — decode whichever is present.
|
- Icons >128px downscaled to 32x32 (nearest-neighbor). Icons ≤128px kept as-is.
|
||||||
- PNG: decode directly
|
- Re-encode as PNG, base64-encode
|
||||||
- GIF/WebP/BMP/JPEG: decode to raster
|
3. Converted entries accumulate in a buffer. Every 120 entries (configurable), serialize as JSON and upload to S3
|
||||||
- SVG: rasterize to 32x32 (use a Go SVG rasterizer library)
|
4. Hosts without icons: included with `"icon": ""`
|
||||||
- Re-encode as optimized PNG at original dimensions (never upscale — a 16x16 stays 16x16)
|
5. Final partial bundle written at end
|
||||||
- Base64-encode the PNG bytes
|
|
||||||
4. For hosts without icons: set icon to empty string
|
|
||||||
5. Chunk into groups of `ENTRIES_PER_BUNDLE` entries (parameterized, initially ~100-150, tuned to viewport fill)
|
|
||||||
6. Serialize each chunk as JSON, write to S3 `everytab-site/tabs/{n}.json`
|
|
||||||
7. Record total bundle count
|
|
||||||
|
|
||||||
**Output:**
|
**Output:**
|
||||||
- `tabs/0.json` through `tabs/{M}.json` in S3 `everytab-site`
|
- `tabs/0000.json` through `tabs/{M}.json` in S3 `everytab-site`
|
||||||
- Total bundle count M
|
- Total bundle count M (bake into frontend via deploy script)
|
||||||
- `stats.json` in S3 `everytab-site` (pipeline statistics)
|
|
||||||
|
|
||||||
**Stats emitted:** Total bundles created, total hosts included (with icon / without icon), average bundle size (bytes), total S3 storage used, icon conversion failures.
|
**Stats emitted:** Total bundles created, total hosts included (with icon / without icon), average bundle size (bytes), total S3 storage used, icon conversion failures.
|
||||||
|
|
||||||
### Stage 6: Frontend Build
|
### Stage 6: Frontend Deploy
|
||||||
|
|
||||||
**Tool:** Simple script or template engine
|
**Tool:** `pipeline/06_frontend/deploy.sh`
|
||||||
|
|
||||||
**Process:**
|
**Process:**
|
||||||
1. Inject `const TOTAL_BUNDLES = {M};` into the JS
|
1. `sed` injects `const TOTAL_BUNDLES = {M};` into a temp copy of `index.html`
|
||||||
2. Write `index.html` and `site.js` to S3 `everytab-site`
|
2. Uploads `index.html`, `site.js`, `bot.html`, `about.html` to S3 `everytab-site`
|
||||||
3. Invalidate CloudFront distribution (`/*`)
|
3. Invalidates CloudFront cache for all four files (auto-detects distribution ID)
|
||||||
|
|
||||||
### Stage 7: Backup & Teardown
|
### Stage 7: Backup & Teardown
|
||||||
|
|
||||||
**Process (manual, with confirmation at each step):**
|
**Process (manual, with confirmation at each step):**
|
||||||
1. Dump RDS database: `pg_dump` → transfer to homelab
|
1. Dump RDS database: `pg_dump -Fc` → transfer to homelab via rsync
|
||||||
2. Sync icons: `aws s3 sync s3://everytab-icons/ homelab:/path/to/backup/icons/`
|
2. Sync icons from local disk: `rsync -avP ~/icons/ homelab:/backups/everytab/icons/`
|
||||||
3. **Verify backups:** confirm pg_dump restores cleanly on homelab, spot-check icon files
|
3. **Verify backups:** confirm pg_dump restores cleanly on homelab, spot-check icon files
|
||||||
4. Delete RDS instance (skip final snapshot — homelab backup is the source of truth, snapshots cost $0.095/GB-month)
|
4. Tear down scanning infra: `terraform apply -var="scanning=false"` (deletes RDS, EC2, icons S3 bucket)
|
||||||
5. Delete S3 `everytab-icons` bucket
|
|
||||||
6. Terminate EC2 instance
|
|
||||||
|
|
||||||
## DNS Architecture
|
## DNS Architecture
|
||||||
|
|
||||||
|
|
@ -376,18 +361,19 @@ Subsequent scrolls: one additional `/tabs/{n}.json` per "page" of tabs.
|
||||||
|
|
||||||
### Tab Rendering
|
### Tab Rendering
|
||||||
|
|
||||||
- Rows of tabs fill the viewport, styled to mimic Firefox browser tabs (v1)
|
- Rows of tabs fill the viewport, styled to match the visitor's browser (Chrome, Firefox, Safari — detected via `navigator.userAgent`)
|
||||||
- Each row has a subtle horizontal marquee animation (CSS `@keyframes` / `animation`) at slightly varying speeds
|
- Each row has a bidirectional marquee animation at varying speeds (90-150s per cycle), with random stagger to avoid synchronization
|
||||||
- Tab density adapts to viewport width (responsive)
|
- Tabs duplicated in DOM for seamless marquee loop (`translateX(-50%)`)
|
||||||
- Each tab shows: favicon (rendered via `<img src="data:image/png;base64,...">`) + truncated title
|
- Each tab shows: favicon (rendered via `<img src="data:image/png;base64,...">`) + truncated title
|
||||||
- No-icon tabs: just title text, no icon (Firefox behavior)
|
- No-icon tabs: just title text, no icon
|
||||||
- Enough tabs rendered to fill viewport + buffer below fold (so user can scroll immediately without waiting for next fetch)
|
- Light mode default, auto-switches to dark mode via `prefers-color-scheme`
|
||||||
|
- Hover shows full title as native tooltip
|
||||||
|
|
||||||
### Interaction
|
### Interaction
|
||||||
|
|
||||||
- **Click tab (iframe_ok=true):** Opens an iframe overlay showing the actual site
|
- **Click tab (iframe_ok=true):** Opens an inline iframe viewer between tab rows (75vh height, pushes content down)
|
||||||
- **Click tab (iframe_ok=false):** Opens site in a new tab (with subtle external-link indicator on the tab)
|
- **Click tab (iframe_ok=false):** Opens site in a new tab (with `↗` external-link indicator on the tab)
|
||||||
- **Close overlay:** X button or click outside dismisses iframe
|
- **Close viewer:** X button or Escape key. Only one viewer open at a time.
|
||||||
- **Scroll down:** When approaching the bottom, fetch next random bundle and render more rows
|
- **Scroll down:** When approaching the bottom, fetch next random bundle and render more rows
|
||||||
|
|
||||||
### Randomization
|
### Randomization
|
||||||
|
|
@ -397,11 +383,11 @@ Subsequent scrolls: one additional `/tabs/{n}.json` per "page" of tabs.
|
||||||
- Generate random bundle indices in range `[0, TOTAL_BUNDLES)`
|
- Generate random bundle indices in range `[0, TOTAL_BUNDLES)`
|
||||||
- Track fetched bundle IDs in a `Set` to avoid loading duplicates on continued scroll
|
- Track fetched bundle IDs in a `Set` to avoid loading duplicates on continued scroll
|
||||||
|
|
||||||
### Future Enhancements (v2+)
|
### Future Enhancements
|
||||||
- Browser-specific tab styles (Chrome tabs for Chrome users, Safari for Safari, etc.)
|
|
||||||
- Mobile-optimized layout
|
- Mobile-optimized layout
|
||||||
- "Search for a site" feature
|
- "Search for a site" feature
|
||||||
- Stats page (how many sites, coverage, etc.)
|
- Stats page (how many sites, coverage, etc.)
|
||||||
|
- Performance: IntersectionObserver to pause off-screen marquee rows
|
||||||
|
|
||||||
## Statistics & Metadata
|
## Statistics & Metadata
|
||||||
|
|
||||||
|
|
@ -486,14 +472,13 @@ This is served publicly at `/stats.json` on the live site — interesting metada
|
||||||
|
|
||||||
| Item | Estimate |
|
| Item | Estimate |
|
||||||
|------|----------|
|
|------|----------|
|
||||||
| EC2 c5.xlarge (~24-48hrs) | $8-16 |
|
| EC2 c5.xlarge (~3-4 days) | $12-16 |
|
||||||
| RDS db.t3.medium (~48-72hrs including dev time) | $3-7 |
|
| EBS 1TB gp3 (~4 days) | $10 |
|
||||||
| S3 everytab-icons storage (~500GB, prorated to days) | $1-3 |
|
| RDS db.t3.medium (~4 days) | $4-6 |
|
||||||
| S3 PUT requests (icon uploads, ~30M) | $15 |
|
|
||||||
| Common Crawl S3 reads (CC-Index + WARCs) | $0 (Open Data) |
|
| Common Crawl S3 reads (CC-Index + WARCs) | $0 (Open Data) |
|
||||||
| Data transfer (icon downloads from internet, inbound) | $0 (inbound free) |
|
| Data transfer (icon downloads from internet, inbound) | $0 (inbound free) |
|
||||||
| Data transfer (backup to homelab, outbound) | $5-10 |
|
| Data transfer (backup to homelab, outbound) | $5-45 (depends on icon archive size) |
|
||||||
| **Total** | **~$32-51** |
|
| **Total** | **~$31-77** |
|
||||||
|
|
||||||
### Hosting Phase (Monthly Steady-State)
|
### Hosting Phase (Monthly Steady-State)
|
||||||
|
|
||||||
|
|
@ -537,11 +522,11 @@ If the site gets significant traffic beyond CloudFront free tier, costs scale wi
|
||||||
2. **Inline icons in bundles** — One fetch gives you 100+ tabs to render. No per-icon requests.
|
2. **Inline icons in bundles** — One fetch gives you 100+ tabs to render. No per-icon requests.
|
||||||
3. **Base64 + Brotli** — Base64 for browser-native decoding (`atob()`). Brotli compression at the CDN layer reduces transfer size by ~25-30% for free.
|
3. **Base64 + Brotli** — Base64 for browser-native decoding (`atob()`). Brotli compression at the CDN layer reduces transfer size by ~25-30% for free.
|
||||||
4. **Unbound as system resolver** — Transparent to application code. Standard Go HTTP. No custom networking.
|
4. **Unbound as system resolver** — Transparent to application code. Standard Go HTTP. No custom networking.
|
||||||
5. **SHA-256 content-addressed icon storage** — Natural dedup at S3 layer. Same favicon stored once even if referenced by multiple hosts.
|
5. **SHA-256 content-addressed icon storage** — Natural dedup on local disk. Same favicon stored once even if referenced by multiple hosts.
|
||||||
6. **Permissive download, selective bundling** — Download ALL favicon formats during scanning. Convert to optimized PNG only during bundle generation. Decouples "capture as much as possible" from "serve the best version."
|
6. **Permissive download, selective bundling** — Download ALL favicon formats and sizes during scanning. Convert to optimized PNG only during bundle generation. Decouples "capture as much as possible" from "serve the best version."
|
||||||
7. **Partial index for work claiming** — Indexes only unscanned rows. Shrinks as work progresses. Minimal write amplification.
|
7. **Partial index for work claiming** — Indexes only unscanned rows. Shrinks as work progresses. Minimal write amplification.
|
||||||
8. **Two S3 buckets** — Clean separation of concerns. Private working storage vs public site. Safe deletion of temporary data.
|
8. **Local disk for icons, S3 for site** — Icons stored on EBS during scanning (avoids ~$175 in S3 PUT costs at 30M scale). Only the static site lives in S3 behind CloudFront.
|
||||||
9. **Per-millisecond random seed** — Every visitor sees a unique arrangement. No shared state, no server needed for randomization.
|
9. **Per-millisecond random seed** — Every visitor sees a unique arrangement. No shared state, no server needed for randomization.
|
||||||
10. **Viewport-sized bundles** — ~100-150 tabs per bundle, tuned to fill a screen. Faster loads, smaller memory footprint than 1MB bundles.
|
10. **Viewport-sized bundles** — ~100-150 tabs per bundle, tuned to fill a screen. Faster loads, smaller memory footprint than 1MB bundles.
|
||||||
11. **Include no-icon hosts** — Any host with a title is included. Firefox-style rendering (title only) for hosts without favicons.
|
11. **Include no-icon hosts** — Any host with a title is included. Firefox-style rendering (title only) for hosts without favicons.
|
||||||
12. **Denormalized best_icon_s3_key in hosts** — Avoids joins during bundle generation. Written once during icon selection, read once during bundling.
|
12. **Denormalized best_icon_s3_key in hosts** — Stores the SHA-256 hash of the chosen icon. Avoids joins during bundle generation. Written once during icon selection, read once during bundling.
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue