update bundle gen to use channels and goroutines to saturate disk and not block on db access + bundle coalesing and uploading
This commit is contained in:
parent
902928235c
commit
081866f62e
2 changed files with 137 additions and 117 deletions
|
|
@ -297,17 +297,21 @@ Uses `DISTINCT ON (host_id)` for efficient single-pass selection. See `pipeline/
|
|||
|
||||
**Input:** All hosts where `html_title IS NOT NULL` (include hosts without icons)
|
||||
|
||||
**Process:**
|
||||
1. Stream hosts from RDS in pages (keyset pagination on `random_order` column for shuffled output)
|
||||
2. For each page, concurrently convert icons (configurable concurrency, default 200):
|
||||
- Read icon from local disk at `{icons_dir}/ab/cd/ef/{hash}`
|
||||
- Decode the image via Go's `image.Decode` (handles PNG, GIF, JPEG, WebP, ICO via registered decoders)
|
||||
- SVGs are excluded (no rasterizer) — these hosts appear without icons
|
||||
**Architecture:** Four-stage pipeline with all stages running concurrently:
|
||||
|
||||
```
|
||||
[DB fetcher] → hostCh → [N converters] → entryCh → [bundle assembler] → uploadCh → [M uploaders]
|
||||
```
|
||||
|
||||
1. **DB fetcher** (1 goroutine): continuously fetches pages of hosts via keyset pagination on `random_order`. Feeds hosts into `hostCh`. Never waits for downstream stages.
|
||||
2. **Converter workers** (N goroutines, default 20): read hosts from `hostCh`, read icon from disk, decode, re-encode as PNG, base64-encode, emit `BundleEntry` to `entryCh`. CPU-bound — default tuned to ~5x core count on c5.xlarge (4 vCPUs).
|
||||
- Decode via Go's `image.Decode` (handles PNG, GIF, JPEG, WebP, BMP, ICO via registered decoders)
|
||||
- SVGs excluded (no rasterizer) — these hosts appear without icons
|
||||
- Icons >128px downscaled to 32x32 (nearest-neighbor). Icons ≤128px kept as-is.
|
||||
- Re-encode as PNG, base64-encode
|
||||
3. Converted entries accumulate in a buffer. Every 120 entries (configurable), serialize as JSON and upload to S3
|
||||
4. Hosts without icons: included with `"icon": ""`
|
||||
5. Final partial bundle written at end
|
||||
3. **Bundle assembler** (1 goroutine): collects entries from `entryCh`. Every 120 entries (configurable), serializes as JSON and sends to `uploadCh`. Hosts without icons included with `"icon": ""`.
|
||||
4. **Upload workers** (M goroutines, default 10): write bundles to S3 (or local disk in dry-run mode). I/O-bound — multiple uploads in flight hides S3 PUT latency (~50-100ms each).
|
||||
|
||||
Bundles are written in-place (overwriting previous run). No delete-first step, so the live site always has valid data even if bundle gen crashes midway. The frontend's `TOTAL_BUNDLES` constant ensures only valid bundle indices are requested.
|
||||
|
||||
**Output:**
|
||||
- `tabs/0000.json` through `tabs/{M}.json` in S3 `everytab-site`
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue