fixed diagram and last tweaks before we plan and code
This commit is contained in:
parent
01b5de040c
commit
a327fb3db3
1 changed files with 50 additions and 36 deletions
|
|
@ -13,28 +13,33 @@ The scanning phase runs monthly (triggered by new Common Crawl releases), produc
|
||||||
|
|
||||||
```mermaid
|
```mermaid
|
||||||
flowchart TD
|
flowchart TD
|
||||||
subgraph "Scanning Phase (EC2 instance)"
|
subgraph EC2["Scanning Phase (EC2 instance)"]
|
||||||
A[Stage 1: Query CC-Index via DuckDB] --> B[Stage 2: Parse WARCs - Go]
|
A["Stage 1: Query CC-Index via DuckDB"]
|
||||||
B --> C[Stage 3: Download Icons - Go]
|
B["Stage 2: Parse WARCs - Go"]
|
||||||
C --> D[Stage 4: Select Best Icons]
|
C["Stage 3: Download Icons - Go"]
|
||||||
D --> E[Stage 5: Generate Bundles - Go]
|
D["Stage 4: Select Best Icons"]
|
||||||
E --> F[Stage 6: Build Frontend]
|
E["Stage 5: Generate Bundles - Go"]
|
||||||
|
F["Stage 6: Build Frontend"]
|
||||||
|
UB["Unbound - Local recursive resolver"]
|
||||||
|
|
||||||
|
A --> B --> C --> D --> E --> F
|
||||||
|
UB -.-> C
|
||||||
end
|
end
|
||||||
|
|
||||||
subgraph "External Data"
|
subgraph ExtData["External Data"]
|
||||||
CC[Common Crawl S3\nParquet Index + WARCs]
|
CC["Common Crawl S3 - Parquet Index + WARCs"]
|
||||||
end
|
end
|
||||||
|
|
||||||
subgraph "AWS Services"
|
subgraph AWS["AWS Services"]
|
||||||
RDS[(RDS Postgres\nhosts + icons tables)]
|
RDS[("RDS Postgres - hosts + icons tables")]
|
||||||
S3I[S3: everytab-icons\nRaw downloaded favicons]
|
S3I["S3: everytab-icons - Raw downloaded favicons"]
|
||||||
S3S[S3: everytab-site\ntabs/*.json + index.html]
|
S3S["S3: everytab-site - tabs/*.json + index.html"]
|
||||||
CF[CloudFront CDN]
|
CF["CloudFront CDN"]
|
||||||
end
|
end
|
||||||
|
|
||||||
subgraph "Post-Scan"
|
subgraph Post["Post-Scan"]
|
||||||
BAK[Backup to Homelab\nRDS dump + icons sync]
|
BAK["Backup to Homelab - RDS dump + icons sync"]
|
||||||
TEAR[Teardown\nDelete RDS, icons bucket, EC2]
|
TEAR["Teardown - Delete RDS, icons bucket, EC2"]
|
||||||
end
|
end
|
||||||
|
|
||||||
CC --> A
|
CC --> A
|
||||||
|
|
@ -51,11 +56,6 @@ flowchart TD
|
||||||
|
|
||||||
F --> BAK
|
F --> BAK
|
||||||
BAK --> TEAR
|
BAK --> TEAR
|
||||||
|
|
||||||
subgraph "DNS"
|
|
||||||
UB[Unbound\nLocal recursive resolver\non EC2]
|
|
||||||
end
|
|
||||||
UB -.-> C
|
|
||||||
```
|
```
|
||||||
|
|
||||||
**Key point:** DuckDB, Go programs, and Unbound all run on the same EC2 instance. The pipeline is sequential — one stage completes before the next begins.
|
**Key point:** DuckDB, Go programs, and Unbound all run on the same EC2 instance. The pipeline is sequential — one stage completes before the next begins.
|
||||||
|
|
@ -114,8 +114,8 @@ All resources in **us-east-1**.
|
||||||
| rel_type | TEXT | MIME type from HTML attribute (if specified) |
|
| rel_type | TEXT | MIME type from HTML attribute (if specified) |
|
||||||
| rel_sizes | TEXT | Sizes attribute from HTML (if specified) |
|
| rel_sizes | TEXT | Sizes attribute from HTML (if specified) |
|
||||||
| content_type | TEXT | Actual MIME type after download |
|
| content_type | TEXT | Actual MIME type after download |
|
||||||
| width | INT | Decoded pixel width |
|
| width | INT | Best usable pixel width (for ICO: largest standard size ≤64; for SVG: NULL) |
|
||||||
| height | INT | Decoded pixel height |
|
| height | INT | Best usable pixel height (for ICO: largest standard size ≤64; for SVG: NULL) |
|
||||||
| file_size | INT | Size in bytes |
|
| file_size | INT | Size in bytes |
|
||||||
| s3_key | TEXT | Key in everytab-icons bucket (SHA-256 of content) |
|
| s3_key | TEXT | Key in everytab-icons bucket (SHA-256 of content) |
|
||||||
| scan_state | TEXT DEFAULT 'unscanned' | `unscanned`, `in_progress`, `completed`, `failed` |
|
| scan_state | TEXT DEFAULT 'unscanned' | `unscanned`, `in_progress`, `completed`, `failed` |
|
||||||
|
|
@ -152,7 +152,7 @@ All resources in **us-east-1**.
|
||||||
|
|
||||||
Icons are stored inline as base64-encoded PNG. Hosts without favicons are included (with `"icon": ""`) as long as they have a title. CloudFront serves bundles with Brotli compression, which significantly reduces transfer size of base64 data.
|
Icons are stored inline as base64-encoded PNG. Hosts without favicons are included (with `"icon": ""`) as long as they have a title. CloudFront serves bundles with Brotli compression, which significantly reduces transfer size of base64 data.
|
||||||
|
|
||||||
Bundle size targets ~100-150 entries (enough to fill a viewport with buffer for scrolling). Estimated ~150-300KB per bundle uncompressed, smaller after Brotli.
|
Bundle size is parameterized (`ENTRIES_PER_BUNDLE`). Target: enough entries to fill a viewport plus scroll buffer. Initial estimate ~100-150 entries (~150-300KB uncompressed, smaller after Brotli). Will be tuned empirically once the frontend is built and we can measure how many tabs fill a screen.
|
||||||
|
|
||||||
## Pipeline Stages
|
## Pipeline Stages
|
||||||
|
|
||||||
|
|
@ -223,14 +223,28 @@ WHERE url_path = '/'
|
||||||
**Input:** `icons` table rows where `scan_state = 'unscanned'`
|
**Input:** `icons` table rows where `scan_state = 'unscanned'`
|
||||||
|
|
||||||
**Process:**
|
**Process:**
|
||||||
1. Claim batch: `UPDATE icons SET scan_state = 'in_progress' WHERE scan_state = 'unscanned' AND id IN (SELECT id FROM icons WHERE scan_state = 'unscanned' LIMIT N FOR UPDATE SKIP LOCKED) RETURNING *`
|
1. Claim batch (randomized to spread load across hosts):
|
||||||
|
```sql
|
||||||
|
UPDATE icons SET scan_state = 'in_progress'
|
||||||
|
WHERE id IN (
|
||||||
|
SELECT id FROM icons
|
||||||
|
WHERE scan_state = 'unscanned'
|
||||||
|
ORDER BY md5(id::text) -- deterministic shuffle: spreads hosts apart
|
||||||
|
LIMIT N
|
||||||
|
FOR UPDATE SKIP LOCKED
|
||||||
|
) RETURNING *;
|
||||||
|
```
|
||||||
|
This ensures requests to the same domain aren't back-to-back. With 30M+ icons from different hosts, a random batch of 1000 almost never contains two icons from the same server.
|
||||||
2. For each icon URL:
|
2. For each icon URL:
|
||||||
- Make HTTP(S) GET request (standard Go HTTP client — DNS transparently goes through Unbound)
|
- Make HTTP(S) GET request (standard Go HTTP client — DNS transparently goes through Unbound)
|
||||||
- Enforce timeouts: 5s connect, 10s total
|
- Enforce timeouts: 5s connect, 10s total
|
||||||
- Enforce max download size: 512KB (generous for icons, but prevents abuse)
|
- Enforce max download size: 512KB (generous for icons, but prevents abuse)
|
||||||
- On success:
|
- On success:
|
||||||
- Validate magic bytes (is this actually an image?)
|
- Validate magic bytes (is this actually an image?)
|
||||||
- Decode to get dimensions (width, height) — just read headers, don't fully decode
|
- Decode to get dimensions:
|
||||||
|
- PNG/GIF/WebP/JPEG/BMP: read image headers for width/height
|
||||||
|
- ICO: parse ICO header, find largest embedded size ≤64x64 at a standard dimension (16/32/48/64), store THAT size in width/height
|
||||||
|
- SVG: store width=NULL, height=NULL (vector, no pixel size)
|
||||||
- Compute SHA-256 of content
|
- Compute SHA-256 of content
|
||||||
- Upload to S3 `everytab-icons/{sha256}` (skip if key already exists — dedup)
|
- Upload to S3 `everytab-icons/{sha256}` (skip if key already exists — dedup)
|
||||||
- Update icons row: s3_key, content_type (from actual data, not HTTP header), width, height, file_size, scan_state = 'completed'
|
- Update icons row: s3_key, content_type (from actual data, not HTTP header), width, height, file_size, scan_state = 'completed'
|
||||||
|
|
@ -303,15 +317,15 @@ UPDATE hosts h SET best_icon_s3_key = (
|
||||||
2. Randomize the full result set
|
2. Randomize the full result set
|
||||||
3. For each host with an icon (best_icon_s3_key IS NOT NULL):
|
3. For each host with an icon (best_icon_s3_key IS NOT NULL):
|
||||||
- Download from S3 `everytab-icons/{s3_key}`
|
- Download from S3 `everytab-icons/{s3_key}`
|
||||||
- Decode the image (handle ICO, PNG, GIF, WebP, SVG):
|
- Decode the image based on format:
|
||||||
- ICO: extract the largest embedded image at a standard size <= 64x64, decode to raster
|
- ICO: parse container, extract the image at the size recorded in width/height (the largest standard size ≤64x64). ICO can embed BMP or PNG internally — decode whichever is present.
|
||||||
- SVG: rasterize to 32x32 PNG
|
- PNG: decode directly
|
||||||
- WebP/GIF/BMP: decode to raster
|
- GIF/WebP/BMP/JPEG: decode to raster
|
||||||
- PNG: use as-is (re-compress if possible)
|
- SVG: rasterize to 32x32 (use a Go SVG rasterizer library)
|
||||||
- Re-encode as optimized PNG (preserve original dimensions, don't upscale)
|
- Re-encode as optimized PNG at original dimensions (never upscale — a 16x16 stays 16x16)
|
||||||
- Base64-encode the PNG bytes
|
- Base64-encode the PNG bytes
|
||||||
4. For hosts without icons: set icon to empty string
|
4. For hosts without icons: set icon to empty string
|
||||||
5. Chunk into groups of N entries (~100-150, tuned to fill a viewport)
|
5. Chunk into groups of `ENTRIES_PER_BUNDLE` entries (parameterized, initially ~100-150, tuned to viewport fill)
|
||||||
6. Serialize each chunk as JSON, write to S3 `everytab-site/tabs/{n}.json`
|
6. Serialize each chunk as JSON, write to S3 `everytab-site/tabs/{n}.json`
|
||||||
7. Record total bundle count
|
7. Record total bundle count
|
||||||
|
|
||||||
|
|
@ -333,11 +347,11 @@ UPDATE hosts h SET best_icon_s3_key = (
|
||||||
|
|
||||||
### Stage 7: Backup & Teardown
|
### Stage 7: Backup & Teardown
|
||||||
|
|
||||||
**Process (manual, with confirmation):**
|
**Process (manual, with confirmation at each step):**
|
||||||
1. Dump RDS database: `pg_dump` → transfer to homelab
|
1. Dump RDS database: `pg_dump` → transfer to homelab
|
||||||
2. Sync icons: `aws s3 sync s3://everytab-icons/ homelab:/path/to/backup/icons/`
|
2. Sync icons: `aws s3 sync s3://everytab-icons/ homelab:/path/to/backup/icons/`
|
||||||
3. **Confirm backups are complete and verified**
|
3. **Verify backups:** confirm pg_dump restores cleanly on homelab, spot-check icon files
|
||||||
4. Delete RDS instance (with final snapshot as safety net)
|
4. Delete RDS instance (skip final snapshot — homelab backup is the source of truth, snapshots cost $0.095/GB-month)
|
||||||
5. Delete S3 `everytab-icons` bucket
|
5. Delete S3 `everytab-icons` bucket
|
||||||
6. Terminate EC2 instance
|
6. Terminate EC2 instance
|
||||||
|
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue