From c50be97fd77a604e65143c6d2f504f0fe407dc86 Mon Sep 17 00:00:00 2001 From: Joe Lothan Date: Sun, 17 May 2026 14:00:14 -0400 Subject: [PATCH] added PLAN.md with initial dev plan --- PLAN.md | 687 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 687 insertions(+) create mode 100644 PLAN.md diff --git a/PLAN.md b/PLAN.md new file mode 100644 index 0000000..64bc82c --- /dev/null +++ b/PLAN.md @@ -0,0 +1,687 @@ +# EveryTab Implementation Plan + +This plan builds the system described in ARCHITECTURE.md in incremental steps. We start with 100K hosts to validate the pipeline end-to-end, then scale to the full ~30M. + +Each step has a clear deliverable and validation criteria. Steps within a phase are sequential; some phases can overlap (noted where applicable). + +--- + +## Phase 0: Project Setup & AWS Infrastructure + +### Step 0.1: Repository Structure + +Create the project layout: + +``` +everytab/ +├── design.md +├── ARCHITECTURE.md +├── PLAN.md +├── infra/ # AWS CLI scripts for setup/teardown +│ ├── setup.sh # Create RDS, S3 buckets, security groups +│ ├── teardown.sh # Delete non-permanent resources +│ └── ec2-userdata.sh # EC2 bootstrap (install Go, DuckDB, Unbound) +├── pipeline/ +│ ├── 01_cc_index/ # DuckDB query scripts +│ ├── 02_warc_parse/ # Go program +│ ├── 03_icon_download/# Go program +│ ├── 04_best_icon/ # SQL script +│ ├── 05_bundle_gen/ # Go program +│ └── 06_frontend/ # Build script, templates +├── frontend/ +│ ├── index.html +│ └── site.js +├── stats/ # Stats output from each stage (gitignored) +└── go.mod # Shared Go module for pipeline programs +``` + +**Done when:** Repo structure exists, `go.mod` initialized, `.gitignore` covers stats/ and any local config. + +### Step 0.2: AWS Infrastructure (Manual CLI) + +Create resources using AWS CLI commands in `infra/setup.sh`: + +1. **S3 buckets:** + - `everytab-icons` (private, no public access) + - `everytab-site` (private, accessed via CloudFront OAC) + +2. **RDS Postgres:** + - `db.t3.medium`, 20GB storage (expandable), Postgres 16 + - In a VPC, security group allows inbound 5432 from EC2 security group + - No public access (EC2 connects within VPC) + - No multi-AZ (dev, not production) + - Set a strong password, store in a local `.env` (gitignored) + +3. **EC2 instance:** + - `c5.xlarge` (4 vCPU, 8GB RAM) — enough for Go concurrency + Unbound cache + - Amazon Linux 2023 or Ubuntu 24.04 + - Security group: allow SSH (from your IP), allow outbound all + - Same VPC/subnet as RDS + - Key pair for SSH access + +4. **CloudFront distribution:** + - Origin: `everytab-site` S3 bucket (OAC) + - Default cache behavior: cache everything, Brotli+Gzip compression + - Can set up now or defer to Phase 2 + +5. **IAM role for EC2:** + - S3 read/write to both buckets + - Attach as instance profile + +**Validation:** SSH into EC2, confirm `psql` can connect to RDS, confirm `aws s3 ls` shows both buckets. + +**Done when:** All resources exist, EC2 can reach RDS and S3. + +### Step 0.3: EC2 Environment Setup + +Bootstrap script (`infra/ec2-userdata.sh` or run manually): + +1. Install Go (latest stable, 1.22+) +2. Install DuckDB CLI +3. Install Unbound, configure as recursive resolver: + - `/etc/unbound/unbound.conf`: recursive mode, no forwarding, listen on 127.0.0.1 + - High cache: `msg-cache-size: 512m`, `rrset-cache-size: 1g` + - `cache-min-ttl: 3600` + - `prefetch: yes` + - `num-threads: 4` +4. Set `/etc/resolv.conf` → `nameserver 127.0.0.1` +5. Install `psql` client, `pg_dump` +6. Confirm DuckDB httpfs extension works: `INSTALL httpfs; LOAD httpfs;` + +**Validation:** +- `go version` works +- `duckdb -c "INSTALL httpfs; LOAD httpfs; SELECT 1;"` works +- `dig example.com @127.0.0.1` resolves (Unbound working) +- `psql $DATABASE_URL -c "SELECT 1;"` connects to RDS + +**Done when:** EC2 is a working development environment for all pipeline stages. + +--- + +## Phase 1: CC-Index Query (Stage 1) + +### Step 1.1: Database Schema + +Create the Postgres tables. Run via `psql`: + +```sql +CREATE TABLE hosts ( + id SERIAL PRIMARY KEY, + hostname TEXT NOT NULL UNIQUE, + protocol TEXT NOT NULL, + crawl_id TEXT NOT NULL, + warc_filename TEXT NOT NULL, + warc_record_offset BIGINT NOT NULL, + warc_record_length INT NOT NULL, + html_title TEXT, + iframe_allowed BOOLEAN, + best_icon_s3_key TEXT, + parsed BOOLEAN DEFAULT FALSE +); + +CREATE TABLE icons ( + id SERIAL PRIMARY KEY, + host_id INT NOT NULL REFERENCES hosts(id), + url TEXT NOT NULL, + source TEXT NOT NULL, + rel_type TEXT, + rel_sizes TEXT, + content_type TEXT, + width INT, + height INT, + file_size INT, + s3_key TEXT, + scan_state TEXT DEFAULT 'unscanned', + error TEXT +); + +CREATE INDEX idx_hosts_parsed ON hosts(id) WHERE parsed = FALSE; +CREATE INDEX idx_icons_unscanned ON icons(id) WHERE scan_state = 'unscanned'; +CREATE INDEX idx_icons_host_id ON icons(host_id); +``` + +**Done when:** Tables exist in RDS, schema matches ARCHITECTURE.md. + +### Step 1.2: DuckDB CC-Index Query (100K limit) + +Write `pipeline/01_cc_index/query.sql` (or a shell script wrapping DuckDB CLI). + +The script: +1. Connects DuckDB to RDS via the postgres extension +2. Queries the CC-Index parquet files via httpfs (latest crawl) +3. Filters per ARCHITECTURE.md criteria +4. Deduplicates per hostname (prefer https) +5. Limits to 100,000 rows for dev +6. Inserts directly into the hosts table + +Key considerations: +- Find the latest crawl index path (e.g., `s3://commoncrawl/cc-index/collections/CC-MAIN-2026-05/indexes/cdx-00*.parquet` — verify actual path structure) +- DuckDB postgres extension: `INSTALL postgres; LOAD postgres; ATTACH 'dbname=... host=... ...' AS pg (TYPE POSTGRES);` +- The dedup logic: partition by hostname, order by protocol (https first), take first row +- Add `LIMIT 100000` for dev, remove for full run +- Time the query — if httpfs takes >1hr, switch to downloading parquet first + +**Validation:** +- `SELECT COUNT(*) FROM hosts;` returns ~100,000 +- `SELECT protocol, COUNT(*) FROM hosts GROUP BY protocol;` shows mostly https +- `SELECT * FROM hosts LIMIT 5;` shows sane data (real hostnames, valid WARC paths) +- Spot-check: pick a few hostnames, verify they're real websites + +**Stats to emit:** `stats/01_cc_index.json` with total_domains, https_count, http_count, query_time_seconds. + +**Done when:** 100K hosts in the database with valid WARC coordinates. + +### Step 1.3: Validate WARC Coordinates + +Quick sanity check — before writing the full WARC parser, confirm we can actually fetch WARC records: + +```bash +# Pick a random row +psql -c "SELECT warc_filename, warc_record_offset, warc_record_length FROM hosts ORDER BY random() LIMIT 1;" + +# Fetch it with curl byte-range +curl -r $OFFSET-$((OFFSET + LENGTH - 1)) "https://data.commoncrawl.org/$WARC_FILENAME" | head -c 500 +``` + +Should see a WARC record header followed by HTTP response headers and HTML. + +**Done when:** We can manually fetch 3-5 WARC records and see valid HTML content. + +--- + +## Phase 2: WARC Parsing (Stage 2) + +### Step 2.1: Go Project Setup + +Set up the shared Go module and the WARC parser binary: + +``` +pipeline/02_warc_parse/ +├── main.go # Entry point, CLI flags, orchestration +├── warc.go # WARC record fetching (S3 byte-range) +├── parser.go # HTML parsing (title, link rel=icon, iframe headers) +└── db.go # Postgres batch read/write +``` + +Dependencies: +- `github.com/jackc/pgx/v5` — Postgres driver (pool, batch operations) +- `golang.org/x/net/html` — Lenient HTML parser +- Standard library `net/http` for S3 byte-range requests + +CLI flags: +- `--db` connection string +- `--batch-size` (default 500) +- `--concurrency` (default 1000) +- `--dry-run` (print parsed results, don't write to DB) +- `--limit` (process at most N rows, for testing) + +**Done when:** Project compiles, connects to DB, can read a batch of hosts rows. + +### Step 2.2: WARC Fetch + Parse Logic + +Implement: +1. Byte-range fetch from `https://data.commoncrawl.org/{warc_filename}` +2. Parse WARC record envelope (find the HTTP response within) +3. Extract HTTP response headers: + - `X-Frame-Options` → if present and not `ALLOWALL`, iframe_allowed = false + - `Content-Security-Policy` → check for `frame-ancestors` directive +4. Parse HTML body: + - Extract `` content (first title tag, truncate at 512 chars) + - Extract all `<link rel="icon">` and `<link rel="shortcut icon">`: + - href (resolve relative URLs against `{protocol}://{hostname}/`) + - type attribute (if present) + - sizes attribute (if present) + - Ignore data: URIs, ignore links to other domains' icons for now + +**Dry-run test:** Run with `--limit 100 --dry-run` and inspect output. Check: +- Titles look reasonable (not empty, not garbage) +- Icon URLs are well-formed (absolute, correct protocol) +- iframe_allowed is set correctly (spot-check against real sites) + +**Done when:** Can parse 100 WARC records correctly with `--dry-run` showing reasonable output. + +### Step 2.3: Batch DB Writes + Full 100K Run + +Implement the database write path: +1. For each parsed host: UPDATE hosts SET html_title, iframe_allowed, parsed = TRUE +2. For each host: INSERT INTO icons (host_id, url, source='favicon_ico') for `/favicon.ico` +3. For each discovered link rel=icon: INSERT INTO icons (host_id, url, source='link_rel', rel_type, rel_sizes) +4. Use batch/bulk operations (pgx CopyFrom or batch INSERT) + +Run against the full 100K hosts: +- Monitor throughput (hosts/sec) +- Watch for errors (log to stderr) + +**Validation:** +- `SELECT COUNT(*) FROM hosts WHERE parsed = TRUE;` should approach 100,000 +- `SELECT COUNT(*) FROM icons;` should be > 100,000 (at minimum one /favicon.ico per host) +- `SELECT COUNT(*) FROM hosts WHERE html_title IS NOT NULL;` — expect 90%+ +- `SELECT source, COUNT(*) FROM icons GROUP BY source;` — see the split +- `SELECT COUNT(*) FROM hosts WHERE iframe_allowed = FALSE;` — expect 30-50% +- Spot-check: pick some hosts, verify title matches the actual site + +**Stats:** `stats/02_warc_parse.json` + +**Done when:** All 100K hosts parsed, icons table populated, stats look reasonable. + +--- + +## Phase 3: Icon Download (Stage 3) + +### Step 3.1: Icon Downloader Go Program + +``` +pipeline/03_icon_download/ +├── main.go # Entry point, CLI flags, worker pool +├── downloader.go # HTTP fetch with timeouts, size limits +├── decoder.go # Image validation + dimension extraction +├── s3.go # Upload to everytab-icons bucket +└── db.go # Claim work, update results +``` + +CLI flags: +- `--db` connection string +- `--s3-bucket` (default `everytab-icons`) +- `--concurrency` (default 1000, tunable) +- `--batch-size` (default 500) +- `--timeout` (default 10s) +- `--max-size` (default 512KB) +- `--dry-run` (fetch and validate but don't upload to S3 or update DB) +- `--limit` (process at most N icons) + +Dependencies: +- `github.com/jackc/pgx/v5` — Postgres +- `github.com/aws/aws-sdk-go-v2` — S3 uploads +- Standard library `image` + sub-packages for decoding dimensions +- A library for ICO parsing (e.g., `github.com/AvraamMavridis/randomcolor` — actually find a proper ICO decoder, or write a simple one that reads the ICO header for directory entries) + +### Step 3.2: Work Claiming + Download Logic + +Implement: +1. Claim batch with randomized order (md5 shuffle, FOR UPDATE SKIP LOCKED) +2. For each icon URL: + - HTTP GET with timeouts (5s dial, 10s total) + - Read up to max-size bytes, abort if exceeded + - Validate magic bytes (PNG: `\x89PNG`, GIF: `GIF8`, ICO: `\x00\x00\x01\x00`, etc.) + - Determine actual content type from magic bytes (don't trust HTTP Content-Type) + - Decode dimensions: + - PNG/GIF/JPEG/WebP/BMP: read image header (Go `image.DecodeConfig`) + - ICO: parse directory entries, find largest at standard size ≤64x64 + - SVG: set width=NULL, height=NULL + - Compute SHA-256 of full content + - Check if S3 key exists (HEAD request); if yes, skip upload (dedup) + - Upload to S3 if new +3. Update icons row with results (or error) + +**Dry-run test:** `--limit 200 --dry-run` — prints what it would do for 200 icons. Check URLs, detected types, dimensions. + +**Done when:** Can download, validate, and upload icons for a small batch. + +### Step 3.3: Full 100K Icon Run + +Run against all icons in the database (likely 150K-300K icon rows for 100K hosts). + +Monitor: +- icons/sec throughput +- Error breakdown (DNS failures, timeouts, HTTP errors, invalid images) +- S3 dedup hit rate +- Memory usage (adjust concurrency if needed) + +**Validation:** +- `SELECT scan_state, COUNT(*) FROM icons GROUP BY scan_state;` — expect mostly completed, some failed +- `SELECT error, COUNT(*) FROM icons WHERE scan_state = 'failed' GROUP BY error ORDER BY count DESC LIMIT 20;` — understand failure modes +- `aws s3 ls s3://everytab-icons/ | wc -l` — confirm icons in S3 +- Spot-check: download a few icons from S3, open them, verify they're valid images + +**Stats:** `stats/03_icon_download.json` + +**Done when:** Icon download complete for 100K dev set, error rate understood, S3 populated. + +--- + +## Phase 4: Best Icon Selection & Bundle Generation (Stages 4-5) + +### Step 4.1: Best Icon Selection SQL + +Write `pipeline/04_best_icon/select.sql`: + +```sql +UPDATE hosts h SET best_icon_s3_key = sub.s3_key +FROM ( + SELECT DISTINCT ON (i.host_id) i.host_id, i.s3_key + FROM icons i + WHERE i.scan_state = 'completed' + ORDER BY i.host_id, + CASE + WHEN i.width = i.height AND i.width IN (64, 48, 32, 16) THEN 0 + WHEN i.width = i.height AND i.width <= 64 THEN 1 + WHEN i.width IS NOT NULL AND i.width <= 64 AND i.height <= 64 THEN 2 + ELSE 3 + END, + COALESCE(i.width, 0) DESC, + CASE + WHEN i.content_type IN ('image/png', 'image/gif', 'image/x-icon', 'image/vnd.microsoft.icon') THEN 0 + WHEN i.content_type = 'image/webp' THEN 1 + WHEN i.content_type = 'image/svg+xml' THEN 2 + ELSE 3 + END, + i.file_size ASC +) sub +WHERE h.id = sub.host_id; +``` + +**Validation:** +- `SELECT COUNT(*) FROM hosts WHERE best_icon_s3_key IS NOT NULL;` — expect 60-80% of hosts +- `SELECT COUNT(*) FROM hosts WHERE html_title IS NOT NULL AND best_icon_s3_key IS NULL;` — hosts with title but no icon (will still be in bundles) +- Spot-check: for a few hosts, verify the selected icon is reasonable (correct size, valid) + +**Stats:** `stats/04_best_icon.json` + +**Done when:** best_icon_s3_key populated for hosts that have valid icons. + +### Step 4.2: Bundle Generator Go Program + +``` +pipeline/05_bundle_gen/ +├── main.go # Entry point, CLI flags +├── db.go # Query hosts + icon keys +├── convert.go # Icon format conversion → PNG +├── bundle.go # Chunk + serialize JSON +└── s3.go # Upload bundles to everytab-site +``` + +CLI flags: +- `--db` connection string +- `--icons-bucket` (default `everytab-icons`) +- `--site-bucket` (default `everytab-site`) +- `--entries-per-bundle` (tunable, start at 120) +- `--dry-run` (generate bundles to local disk, don't upload) +- `--limit` (only process N hosts, for testing) + +### Step 4.3: Icon Conversion Logic + +Implement format conversion to PNG: +1. Download icon from S3 by key +2. Detect format from magic bytes +3. Decode: + - PNG: decode directly + - ICO: parse container, extract image at recorded width/height, decode BMP or PNG within + - GIF/JPEG/BMP/WebP: decode to RGBA + - SVG: rasterize to 32x32 (use a Go SVG library, or shell out to `rsvg-convert` if simpler) +4. Re-encode as PNG (optimized, don't upscale) +5. Base64-encode + +**Test:** Convert 50 icons of mixed formats manually, verify output PNGs look correct. + +### Step 4.4: Bundle Assembly + Upload + +Implement: +1. Query all hosts WHERE html_title IS NOT NULL, randomize (ORDER BY random()) +2. For each host: fetch + convert its icon (or set empty string if no icon) +3. Assemble entries into chunks of `ENTRIES_PER_BUNDLE` +4. Serialize each chunk as JSON (`tabs/{n}.json`) +5. Upload to S3 `everytab-site/tabs/` +6. Record total bundle count + +**Dry-run:** Generate bundles to local disk, inspect a few: +- Valid JSON +- Icons render in browser (paste a data:image/png;base64,... URI) +- Entries have host, title, icon, icon_w, icon_h, iframe_ok + +**Validation:** +- Bundle files exist in S3 +- `aws s3 ls s3://everytab-site/tabs/ | wc -l` matches expected count +- Random bundle can be fetched and parsed as JSON +- Total hosts across all bundles = count of hosts with titles + +**Stats:** `stats/05_bundle_gen.json` + +**Done when:** All bundles uploaded to S3, JSON is valid, icons render. + +--- + +## Phase 5: Frontend (Stage 6) + +This phase can begin in parallel with Phase 3-4 using mock bundle data. + +### Step 5.1: Mock Data for Frontend Dev + +Generate 2-3 small mock bundle files (`tabs/0.json`, `tabs/1.json`, `tabs/2.json`) with ~20 entries each. Use real favicons (Google, GitHub, Wikipedia, etc.) manually base64-encoded. This lets us develop the frontend without waiting for the pipeline. + +Serve locally with any static file server (`python -m http.server`). + +**Done when:** Mock bundles exist and can be served locally. + +### Step 5.2: Basic Tab Rendering + +Build `frontend/index.html` and `frontend/site.js`: + +1. HTML: minimal shell with a container div, inline CSS for tab styling +2. JS: fetch a bundle, render tabs as rows filling the viewport +3. Tab appearance: mimic Firefox tab shape (rounded top corners, slight border) +4. Each tab shows favicon (16x16 or 32x32 img from data URI) + truncated title +5. No-icon tabs show title only + +Focus: get the visual density right. How many tabs fit across? How many rows fill the viewport? This determines `ENTRIES_PER_BUNDLE`. + +**Done when:** Page renders tabs from a mock bundle. Visually looks like a page full of browser tabs. + +### Step 5.3: Marquee Animation + +Add horizontal marquee to each row: +- CSS `@keyframes` animation, translateX +- Each row at slightly different speed and direction (some left, some right) +- Smooth, subtle movement — not distracting, just enough to feel alive +- Rows need extra tabs beyond viewport width to avoid gaps during scroll + +**Done when:** Rows scroll smoothly, no visual glitches at edges. + +### Step 5.4: Interaction — Click, Iframe, Close + +Implement tab click behavior: +1. If `iframe_ok`: show an overlay with iframe loading the site (`{protocol}://{hostname}`) +2. If `!iframe_ok`: open in new tab (`target="_blank"`, add rel="noopener") +3. Visual indicator on tabs that will open externally (small icon/badge) +4. Close overlay: X button + click-outside + Escape key + +**Done when:** Clicking tabs works correctly for both iframe and external cases. + +### Step 5.5: Infinite Scroll + Random Bundle Loading + +Implement: +1. Seeded PRNG using `Date.now()` — generates deterministic sequence of bundle indices +2. On page load: fetch first bundle, render +3. Scroll detection: when user approaches bottom, fetch next random bundle +4. Track loaded bundle IDs in a Set (no duplicates) +5. Append new rows below existing ones +6. Handle edge case: all bundles loaded (unlikely with 50K+ bundles but handle gracefully) + +`TOTAL_BUNDLES` is a constant baked into the JS at build time. + +**Done when:** Infinite scroll works, new bundles load seamlessly, no duplicate bundles. + +### Step 5.6: Frontend Build Script + +Write `pipeline/06_frontend/build.sh`: +1. Read total bundle count (from pipeline output or S3) +2. Inject `const TOTAL_BUNDLES = {M};` into site.js +3. Copy index.html + site.js to S3 `everytab-site/` +4. Invalidate CloudFront (if distribution exists) + +**Done when:** Build script produces deployable frontend with correct bundle count. + +--- + +## Phase 6: Integration & End-to-End Test (100K) + +### Step 6.1: Run Full Pipeline (100K) + +Execute all stages in sequence on EC2: +1. Verify hosts table has 100K entries (from Phase 1) +2. Run WARC parser (Phase 2) — should complete in minutes +3. Run icon downloader (Phase 3) — should complete in 10-30 minutes at 100K scale +4. Run best icon selection (Phase 4.1) +5. Run bundle generator (Phase 4.2-4.4) +6. Run frontend build (Phase 5.6) + +**Validation:** Visit the CloudFront URL. The site should work: +- Tabs render with real favicons and titles +- Clicking works (iframe + external) +- Scrolling loads more tabs +- No JS console errors + +### Step 6.2: Tune Parameters + +Based on the 100K run: +- **ENTRIES_PER_BUNDLE:** Look at the live site. Does one bundle fill the screen? Too many tabs? Too few? Adjust. +- **Concurrency:** Was the icon download memory-stable? CPU-bound or network-bound? Adjust goroutine pool size. +- **Timeouts:** What was the error distribution? Are timeouts too aggressive? Too lenient? +- **Icon selection:** Do the selected icons look good? Any weird sizes or broken images? + +Update CLI flag defaults based on findings. + +### Step 6.3: Collect & Review Stats + +Merge all `stats/*.json` into a single pipeline report. Review: +- Loss at each stage (domains → parsed → icons downloaded → icons selected → bundled) +- Time per stage +- Error patterns (are certain TLDs failing more? certain icon formats?) +- Storage usage (S3 icons bucket, S3 site bucket) + +Identify any pipeline bugs or data quality issues. Fix before scaling up. + +**Done when:** End-to-end works at 100K, parameters tuned, stats reviewed, bugs fixed. + +--- + +## Phase 7: Full-Scale Run (30M) + +### Step 7.1: Remove Limits, Re-run CC-Index Query + +Update the DuckDB query to remove `LIMIT 100000`. Re-run. + +Considerations: +- If httpfs takes >1hr, switch to downloading the parquet files first +- May need to increase RDS storage (30M rows with WARC paths ≈ 5-10GB) +- Monitor DuckDB memory usage + +**Validation:** `SELECT COUNT(*) FROM hosts;` shows ~30M rows. + +### Step 7.2: Run WARC Parser at Scale + +Run with full concurrency against 30M hosts. Expected time: 2-6 hours. + +Monitor: +- Throughput (hosts/sec) +- Error rate stability (should plateau, not climb) +- Postgres connection pool health +- Memory usage + +### Step 7.3: Run Icon Downloader at Scale + +This is the long pole — expected 12-48 hours. + +Monitor continuously: +- icons/sec rate +- DNS cache hit rate (check Unbound stats: `unbound-control stats`) +- S3 upload rate +- Error rate by type +- Completion percentage + +If too slow (projected >48hrs): +- Consider increasing concurrency (if memory allows) +- Consider spinning up fleet (add more EC2 instances running the same binary) +- Check if DNS is the bottleneck (Unbound stats) +- Check if S3 uploads are the bottleneck (batch or reduce HEAD checks) + +### Step 7.4: Best Icon Selection + Bundle Generation + +Run at full scale. Expected: 1-2 hours total. + +Monitor bundle sizes — verify they're in the expected range with `ENTRIES_PER_BUNDLE` from tuning. + +### Step 7.5: Rebuild Frontend + Deploy + +Run frontend build with the real bundle count. Invalidate CloudFront. + +**Validation:** Visit the live site. Browse around. Check: +- Tab variety (seeing diverse sites, not just one TLD) +- Icon quality (no broken images, reasonable sizes) +- Performance (bundles load quickly, no jank) +- Stats page / stats.json looks correct + +**Done when:** Full-scale site is live and working. + +--- + +## Phase 8: Backup & Teardown + +### Step 8.1: Backup RDS to Homelab + +```bash +# On EC2 (fast connection to RDS): +pg_dump -Fc $DATABASE_URL > everytab_dump.pgfc + +# Transfer to homelab (from EC2 or direct): +scp everytab_dump.pgfc homelab:/backups/everytab/ + +# On homelab, verify restore: +pg_restore -d everytab_local everytab_dump.pgfc +psql everytab_local -c "SELECT COUNT(*) FROM hosts; SELECT COUNT(*) FROM icons;" +``` + +### Step 8.2: Backup Icons S3 to Homelab + +```bash +# From homelab (or EC2 as intermediary): +aws s3 sync s3://everytab-icons/ /backups/everytab/icons/ + +# Verify file count matches: +ls /backups/everytab/icons/ | wc -l +# Compare with: aws s3 ls s3://everytab-icons/ | wc -l +``` + +### Step 8.3: Verify & Teardown + +After confirming backups: + +```bash +# Verify the live site still works (it only depends on everytab-site + CloudFront) +curl -s https://your-cloudfront-domain.net/ | head + +# Teardown scanning infrastructure: +aws rds delete-db-instance --db-instance-identifier everytab --skip-final-snapshot +aws s3 rb s3://everytab-icons --force +aws ec2 terminate-instances --instance-ids i-xxxxx +``` + +**Done when:** Only `everytab-site` S3 bucket + CloudFront remain running. Monthly cost: ~$2-4. + +--- + +## Development Notes + +### What Can Be Parallelized + +- **Frontend dev (Phase 5.1-5.5)** can happen at any time using mock data +- **AWS infra setup (Phase 0.2)** can happen while writing code locally +- **Icon downloader (Phase 3)** and **bundle generator (Phase 4)** are independent codebases, can be written in parallel + +### Testing Strategy + +- **Dry-run flags** on all Go programs: print what would happen without mutating DB/S3 +- **--limit flags** on all Go programs: process a small subset quickly +- **Spot-checks:** after each stage, manually verify 5-10 random entries +- **Stats files:** compare counts between stages to catch data loss +- **100K dev set:** full pipeline at small scale before committing to a 24hr+ full run + +### Common Pitfalls to Watch For + +- **DuckDB CC-Index path:** The exact S3 path to parquet files changes per crawl. Check Common Crawl's website for the latest crawl ID and index location. +- **WARC record format:** WARC records have a specific envelope format (WARC/1.0 header, blank line, HTTP response). Don't assume the HTTP response starts at byte 0. +- **Relative icon URLs:** `/favicon.ico` is relative to root, but `favicon.ico` (no leading slash) is relative to the page path. Since we only have root pages (`/`), both resolve the same. But `../icons/fav.png` could be tricky — handle gracefully or skip. +- **ICO files are complex:** The ICO container format can embed BMP (with a modified header) or PNG. Many "ICO" files are actually just PNGs renamed to .ico. Check magic bytes, not file extension. +- **SVG rasterization:** Go doesn't have great native SVG support. Consider shelling out to `rsvg-convert` or `librsvg`, or use a Go library like `github.com/nicholasgasior/goresvg`. This can be a follow-up if SVG icons are rare. +- **Postgres connection limits:** RDS db.t3.medium has max_connections ≈ 80. With 1000 goroutines, we need connection pooling (pgx pool handles this). Set pool max to ~40 connections. +- **S3 eventual consistency:** After uploading an icon, a HEAD request might not find it immediately. For dedup checks, handle "not found" gracefully (just upload again — idempotent since key is content hash). +- **CloudFront caching:** After deploying new bundles, invalidate `/*` or set short TTL during development. For production, use long TTLs (bundles are immutable between crawls).