everytab/PLAN.md

# EveryTab Implementation Plan

This plan builds the system described in ARCHITECTURE.md in incremental steps. We start with 100K hosts to validate the pipeline end-to-end, then scale to the full ~30M.

Each step has a clear deliverable and validation criteria. Steps within a phase are sequential; some phases can overlap (noted where applicable).

---

## Phase 0: Project Setup & AWS Infrastructure

### Step 0.1: Repository Structure

Create the project layout:

```
everytab/
├── design.md
├── ARCHITECTURE.md
├── PLAN.md
├── infra/               # AWS CLI scripts for setup/teardown
│   ├── setup.sh         # Create RDS, S3 buckets, security groups
│   ├── teardown.sh      # Delete non-permanent resources
│   └── ec2-userdata.sh  # EC2 bootstrap (install Go, DuckDB, Unbound)
├── pipeline/
│   ├── 01_cc_index/     # DuckDB query scripts
│   ├── 02_warc_parse/   # Go program
│   ├── 03_icon_download/# Go program
│   ├── 04_best_icon/    # SQL script
│   ├── 05_bundle_gen/   # Go program
│   └── 06_frontend/     # Build script, templates
├── frontend/
│   ├── index.html
│   └── site.js
├── stats/               # Stats output from each stage (gitignored)
└── go.mod               # Shared Go module for pipeline programs
```

**Done when:** Repo structure exists, `go.mod` initialized, `.gitignore` covers stats/ and any local config.

### Step 0.2: AWS Infrastructure (Manual CLI)

Create resources using AWS CLI commands in `infra/setup.sh`:

1. **S3 buckets:**
   - `everytab-icons` (private, no public access)
   - `everytab-site` (private, accessed via CloudFront OAC)

2. **RDS Postgres:**
   - `db.t3.medium`, 20GB storage (expandable), Postgres 16
   - In a VPC, security group allows inbound 5432 from EC2 security group
   - No public access (EC2 connects within VPC)
   - No multi-AZ (dev, not production)
   - Set a strong password, store in a local `.env` (gitignored)

3. **EC2 instance:**
   - `c5.xlarge` (4 vCPU, 8GB RAM) — enough for Go concurrency + Unbound cache
   - Amazon Linux 2023 or Ubuntu 24.04
   - Security group: allow SSH (from your IP), allow outbound all
   - Same VPC/subnet as RDS
   - Key pair for SSH access

4. **CloudFront distribution:**
   - Origin: `everytab-site` S3 bucket (OAC)
   - Default cache behavior: cache everything, Brotli+Gzip compression
   - Can set up now or defer to Phase 2

5. **IAM role for EC2:**
   - S3 read/write to both buckets
   - Attach as instance profile

**Validation:** SSH into EC2, confirm `psql` can connect to RDS, confirm `aws s3 ls` shows both buckets.

**Done when:** All resources exist, EC2 can reach RDS and S3.

### Step 0.3: EC2 Environment Setup

Bootstrap script (`infra/ec2-userdata.sh` or run manually):

1. Install Go (latest stable, 1.22+)
2. Install DuckDB CLI
3. Install Unbound, configure as recursive resolver:
   - `/etc/unbound/unbound.conf`: recursive mode, no forwarding, listen on 127.0.0.1
   - High cache: `msg-cache-size: 512m`, `rrset-cache-size: 1g`
   - `cache-min-ttl: 3600`
   - `prefetch: yes`
   - `num-threads: 4`
4. Set `/etc/resolv.conf` → `nameserver 127.0.0.1`
5. Install `psql` client, `pg_dump`
6. Confirm DuckDB httpfs extension works: `INSTALL httpfs; LOAD httpfs;`

**Validation:**
- `go version` works
- `duckdb -c "INSTALL httpfs; LOAD httpfs; SELECT 1;"` works
- `dig example.com @127.0.0.1` resolves (Unbound working)
- `psql $DATABASE_URL -c "SELECT 1;"` connects to RDS

**Done when:** EC2 is a working development environment for all pipeline stages.

---

## Phase 1: CC-Index Query (Stage 1)

### Step 1.1: Database Schema

Create the Postgres tables. Run via `psql`:

```sql
CREATE TABLE hosts (
    id SERIAL PRIMARY KEY,
    hostname TEXT NOT NULL UNIQUE,
    protocol TEXT NOT NULL,
    crawl_id TEXT NOT NULL,
    warc_filename TEXT NOT NULL,
    warc_record_offset BIGINT NOT NULL,
    warc_record_length INT NOT NULL,
    html_title TEXT,
    iframe_allowed BOOLEAN,
    best_icon_s3_key TEXT,
    parsed BOOLEAN DEFAULT FALSE
);

CREATE TABLE icons (
    id SERIAL PRIMARY KEY,
    host_id INT NOT NULL REFERENCES hosts(id),
    url TEXT NOT NULL,
    source TEXT NOT NULL,
    rel_type TEXT,
    rel_sizes TEXT,
    content_type TEXT,
    width INT,
    height INT,
    file_size INT,
    s3_key TEXT,
    scan_state TEXT DEFAULT 'unscanned',
    error TEXT
);

CREATE INDEX idx_hosts_parsed ON hosts(id) WHERE parsed = FALSE;
CREATE INDEX idx_icons_unscanned ON icons(id) WHERE scan_state = 'unscanned';
CREATE INDEX idx_icons_host_id ON icons(host_id);
```

**Done when:** Tables exist in RDS, schema matches ARCHITECTURE.md.

### Step 1.2: DuckDB CC-Index Query (100K limit)

Write `pipeline/01_cc_index/query.sql` (or a shell script wrapping DuckDB CLI).

The script:
1. Connects DuckDB to RDS via the postgres extension
2. Queries the CC-Index parquet files via httpfs (latest crawl)
3. Filters per ARCHITECTURE.md criteria
4. Deduplicates per hostname (prefer https)
5. Limits to 100,000 rows for dev
6. Inserts directly into the hosts table

Key considerations:
- Find the latest crawl index path (e.g., `s3://commoncrawl/cc-index/collections/CC-MAIN-2026-05/indexes/cdx-00*.parquet` — verify actual path structure)
- DuckDB postgres extension: `INSTALL postgres; LOAD postgres; ATTACH 'dbname=... host=... ...' AS pg (TYPE POSTGRES);`
- The dedup logic: partition by hostname, order by protocol (https first), take first row
- Add `LIMIT 100000` for dev, remove for full run
- Time the query — if httpfs takes >1hr, switch to downloading parquet first

**Validation:**
- `SELECT COUNT(*) FROM hosts;` returns ~100,000
- `SELECT protocol, COUNT(*) FROM hosts GROUP BY protocol;` shows mostly https
- `SELECT * FROM hosts LIMIT 5;` shows sane data (real hostnames, valid WARC paths)
- Spot-check: pick a few hostnames, verify they're real websites

**Stats to emit:** `stats/01_cc_index.json` with total_domains, https_count, http_count, query_time_seconds.

**Done when:** 100K hosts in the database with valid WARC coordinates.

### Step 1.3: Validate WARC Coordinates

Quick sanity check — before writing the full WARC parser, confirm we can actually fetch WARC records:

```bash
# Pick a random row
psql -c "SELECT warc_filename, warc_record_offset, warc_record_length FROM hosts ORDER BY random() LIMIT 1;"

# Fetch it with curl byte-range
curl -r $OFFSET-$((OFFSET + LENGTH - 1)) "https://data.commoncrawl.org/$WARC_FILENAME" | head -c 500
```

Should see a WARC record header followed by HTTP response headers and HTML.

**Done when:** We can manually fetch 3-5 WARC records and see valid HTML content.

---

## Phase 2: WARC Parsing (Stage 2)

### Step 2.1: Go Project Setup

Set up the shared Go module and the WARC parser binary:

```
pipeline/02_warc_parse/
├── main.go          # Entry point, CLI flags, orchestration
├── warc.go          # WARC record fetching (S3 byte-range)
├── parser.go        # HTML parsing (title, link rel=icon, iframe headers)
└── db.go            # Postgres batch read/write
```

Dependencies:
- `github.com/jackc/pgx/v5` — Postgres driver (pool, batch operations)
- `golang.org/x/net/html` — Lenient HTML parser
- Standard library `net/http` for S3 byte-range requests

CLI flags:
- `--db` connection string
- `--batch-size` (default 500)
- `--concurrency` (default 1000)
- `--dry-run` (print parsed results, don't write to DB)
- `--limit` (process at most N rows, for testing)

**Done when:** Project compiles, connects to DB, can read a batch of hosts rows.

### Step 2.2: WARC Fetch + Parse Logic

Implement:
1. Byte-range fetch from `https://data.commoncrawl.org/{warc_filename}`
2. Parse WARC record envelope (find the HTTP response within)
3. Extract HTTP response headers:
   - `X-Frame-Options` → if present and not `ALLOWALL`, iframe_allowed = false
   - `Content-Security-Policy` → check for `frame-ancestors` directive
4. Parse HTML body:
   - Extract `<title>` content (first title tag, truncate at 512 chars)
   - Extract all `<link rel="icon">` and `<link rel="shortcut icon">`:
     - href (resolve relative URLs against `{protocol}://{hostname}/`)
     - type attribute (if present)
     - sizes attribute (if present)
   - Ignore data: URIs, ignore links to other domains' icons for now

**Dry-run test:** Run with `--limit 100 --dry-run` and inspect output. Check:
- Titles look reasonable (not empty, not garbage)
- Icon URLs are well-formed (absolute, correct protocol)
- iframe_allowed is set correctly (spot-check against real sites)

**Done when:** Can parse 100 WARC records correctly with `--dry-run` showing reasonable output.

### Step 2.3: Batch DB Writes + Full 100K Run

Implement the database write path:
1. For each parsed host: UPDATE hosts SET html_title, iframe_allowed, parsed = TRUE
2. For each host: INSERT INTO icons (host_id, url, source='favicon_ico') for `/favicon.ico`
3. For each discovered link rel=icon: INSERT INTO icons (host_id, url, source='link_rel', rel_type, rel_sizes)
4. Use batch/bulk operations (pgx CopyFrom or batch INSERT)

Run against the full 100K hosts:
- Monitor throughput (hosts/sec)
- Watch for errors (log to stderr)

**Validation:**
- `SELECT COUNT(*) FROM hosts WHERE parsed = TRUE;` should approach 100,000
- `SELECT COUNT(*) FROM icons;` should be > 100,000 (at minimum one /favicon.ico per host)
- `SELECT COUNT(*) FROM hosts WHERE html_title IS NOT NULL;` — expect 90%+
- `SELECT source, COUNT(*) FROM icons GROUP BY source;` — see the split
- `SELECT COUNT(*) FROM hosts WHERE iframe_allowed = FALSE;` — expect 30-50%
- Spot-check: pick some hosts, verify title matches the actual site

**Stats:** `stats/02_warc_parse.json`

**Done when:** All 100K hosts parsed, icons table populated, stats look reasonable.

---

## Phase 3: Icon Download (Stage 3)

### Step 3.1: Icon Downloader Go Program

```
pipeline/03_icon_download/
├── main.go          # Entry point, CLI flags, worker pool
├── downloader.go    # HTTP fetch with timeouts, size limits
├── decoder.go       # Image validation + dimension extraction
├── s3.go            # Upload to everytab-icons bucket
└── db.go            # Claim work, update results
```

CLI flags:
- `--db` connection string
- `--s3-bucket` (default `everytab-icons`)
- `--concurrency` (default 1000, tunable)
- `--batch-size` (default 500)
- `--timeout` (default 10s)
- `--max-size` (default 512KB)
- `--dry-run` (fetch and validate but don't upload to S3 or update DB)
- `--limit` (process at most N icons)

Dependencies:
- `github.com/jackc/pgx/v5` — Postgres
- `github.com/aws/aws-sdk-go-v2` — S3 uploads
- Standard library `image` + sub-packages for decoding dimensions
- A library for ICO parsing (e.g., `github.com/AvraamMavridis/randomcolor` — actually find a proper ICO decoder, or write a simple one that reads the ICO header for directory entries)

### Step 3.2: Work Claiming + Download Logic

Implement:
1. Claim batch with randomized order (md5 shuffle, FOR UPDATE SKIP LOCKED)
2. For each icon URL:
   - HTTP GET with timeouts (5s dial, 10s total)
   - Read up to max-size bytes, abort if exceeded
   - Validate magic bytes (PNG: `\x89PNG`, GIF: `GIF8`, ICO: `\x00\x00\x01\x00`, etc.)
   - Determine actual content type from magic bytes (don't trust HTTP Content-Type)
   - Decode dimensions:
     - PNG/GIF/JPEG/WebP/BMP: read image header (Go `image.DecodeConfig`)
     - ICO: parse directory entries, find largest at standard size ≤64x64
     - SVG: set width=NULL, height=NULL
   - Compute SHA-256 of full content
   - Check if S3 key exists (HEAD request); if yes, skip upload (dedup)
   - Upload to S3 if new
3. Update icons row with results (or error)

**Dry-run test:** `--limit 200 --dry-run` — prints what it would do for 200 icons. Check URLs, detected types, dimensions.

**Done when:** Can download, validate, and upload icons for a small batch.

### Step 3.3: Full 100K Icon Run

Run against all icons in the database (likely 150K-300K icon rows for 100K hosts).

Monitor:
- icons/sec throughput
- Error breakdown (DNS failures, timeouts, HTTP errors, invalid images)
- S3 dedup hit rate
- Memory usage (adjust concurrency if needed)

**Validation:**
- `SELECT scan_state, COUNT(*) FROM icons GROUP BY scan_state;` — expect mostly completed, some failed
- `SELECT error, COUNT(*) FROM icons WHERE scan_state = 'failed' GROUP BY error ORDER BY count DESC LIMIT 20;` — understand failure modes
- `aws s3 ls s3://everytab-icons/ | wc -l` — confirm icons in S3
- Spot-check: download a few icons from S3, open them, verify they're valid images

**Stats:** `stats/03_icon_download.json`

**Done when:** Icon download complete for 100K dev set, error rate understood, S3 populated.

---

## Phase 4: Best Icon Selection & Bundle Generation (Stages 4-5)

### Step 4.1: Best Icon Selection SQL

Write `pipeline/04_best_icon/select.sql`:

```sql
UPDATE hosts h SET best_icon_s3_key = sub.s3_key
FROM (
  SELECT DISTINCT ON (i.host_id) i.host_id, i.s3_key
  FROM icons i
  WHERE i.scan_state = 'completed'
  ORDER BY i.host_id,
    CASE
      WHEN i.width = i.height AND i.width IN (64, 48, 32, 16) THEN 0
      WHEN i.width = i.height AND i.width <= 64 THEN 1
      WHEN i.width IS NOT NULL AND i.width <= 64 AND i.height <= 64 THEN 2
      ELSE 3
    END,
    COALESCE(i.width, 0) DESC,
    CASE
      WHEN i.content_type IN ('image/png', 'image/gif', 'image/x-icon', 'image/vnd.microsoft.icon') THEN 0
      WHEN i.content_type = 'image/webp' THEN 1
      WHEN i.content_type = 'image/svg+xml' THEN 2
      ELSE 3
    END,
    i.file_size ASC
) sub
WHERE h.id = sub.host_id;
```

**Validation:**
- `SELECT COUNT(*) FROM hosts WHERE best_icon_s3_key IS NOT NULL;` — expect 60-80% of hosts
- `SELECT COUNT(*) FROM hosts WHERE html_title IS NOT NULL AND best_icon_s3_key IS NULL;` — hosts with title but no icon (will still be in bundles)
- Spot-check: for a few hosts, verify the selected icon is reasonable (correct size, valid)

**Stats:** `stats/04_best_icon.json`

**Done when:** best_icon_s3_key populated for hosts that have valid icons.

### Step 4.2: Bundle Generator Go Program

```
pipeline/05_bundle_gen/
├── main.go          # Entry point, CLI flags
├── db.go            # Query hosts + icon keys
├── convert.go       # Icon format conversion → PNG
├── bundle.go        # Chunk + serialize JSON
└── s3.go            # Upload bundles to everytab-site
```

CLI flags:
- `--db` connection string
- `--icons-bucket` (default `everytab-icons`)
- `--site-bucket` (default `everytab-site`)
- `--entries-per-bundle` (tunable, start at 120)
- `--dry-run` (generate bundles to local disk, don't upload)
- `--limit` (only process N hosts, for testing)

### Step 4.3: Icon Conversion Logic

Implement format conversion to PNG:
1. Download icon from S3 by key
2. Detect format from magic bytes
3. Decode:
   - PNG: decode directly
   - ICO: parse container, extract image at recorded width/height, decode BMP or PNG within
   - GIF/JPEG/BMP/WebP: decode to RGBA
   - SVG: rasterize to 32x32 (use a Go SVG library, or shell out to `rsvg-convert` if simpler)
4. Re-encode as PNG (optimized, don't upscale)
5. Base64-encode

**Test:** Convert 50 icons of mixed formats manually, verify output PNGs look correct.

### Step 4.4: Bundle Assembly + Upload

Implement:
1. Query all hosts WHERE html_title IS NOT NULL, randomize (ORDER BY random())
2. For each host: fetch + convert its icon (or set empty string if no icon)
3. Assemble entries into chunks of `ENTRIES_PER_BUNDLE`
4. Serialize each chunk as JSON (`tabs/{n}.json`)
5. Upload to S3 `everytab-site/tabs/`
6. Record total bundle count

**Dry-run:** Generate bundles to local disk, inspect a few:
- Valid JSON
- Icons render in browser (paste a data:image/png;base64,... URI)
- Entries have host, title, icon, icon_w, icon_h, iframe_ok

**Validation:**
- Bundle files exist in S3
- `aws s3 ls s3://everytab-site/tabs/ | wc -l` matches expected count
- Random bundle can be fetched and parsed as JSON
- Total hosts across all bundles = count of hosts with titles

**Stats:** `stats/05_bundle_gen.json`

**Done when:** All bundles uploaded to S3, JSON is valid, icons render.

---

## Phase 5: Frontend (Stage 6)

This phase can begin in parallel with Phase 3-4 using mock bundle data.

### Step 5.1: Mock Data for Frontend Dev

Generate 2-3 small mock bundle files (`tabs/0.json`, `tabs/1.json`, `tabs/2.json`) with ~20 entries each. Use real favicons (Google, GitHub, Wikipedia, etc.) manually base64-encoded. This lets us develop the frontend without waiting for the pipeline.

Serve locally with any static file server (`python -m http.server`).

**Done when:** Mock bundles exist and can be served locally.

### Step 5.2: Basic Tab Rendering

Build `frontend/index.html` and `frontend/site.js`:

1. HTML: minimal shell with a container div, inline CSS for tab styling
2. JS: fetch a bundle, render tabs as rows filling the viewport
3. Tab appearance: mimic Firefox tab shape (rounded top corners, slight border)
4. Each tab shows favicon (16x16 or 32x32 img from data URI) + truncated title
5. No-icon tabs show title only

Focus: get the visual density right. How many tabs fit across? How many rows fill the viewport? This determines `ENTRIES_PER_BUNDLE`.

**Done when:** Page renders tabs from a mock bundle. Visually looks like a page full of browser tabs.

### Step 5.3: Marquee Animation

Add horizontal marquee to each row:
- CSS `@keyframes` animation, translateX
- Each row at slightly different speed and direction (some left, some right)
- Smooth, subtle movement — not distracting, just enough to feel alive
- Rows need extra tabs beyond viewport width to avoid gaps during scroll

**Done when:** Rows scroll smoothly, no visual glitches at edges.

### Step 5.4: Interaction — Click, Iframe, Close

Implement tab click behavior:
1. If `iframe_ok`: show an overlay with iframe loading the site (`{protocol}://{hostname}`)
2. If `!iframe_ok`: open in new tab (`target="_blank"`, add rel="noopener")
3. Visual indicator on tabs that will open externally (small icon/badge)
4. Close overlay: X button + click-outside + Escape key

**Done when:** Clicking tabs works correctly for both iframe and external cases.

### Step 5.5: Infinite Scroll + Random Bundle Loading

Implement:
1. Seeded PRNG using `Date.now()` — generates deterministic sequence of bundle indices
2. On page load: fetch first bundle, render
3. Scroll detection: when user approaches bottom, fetch next random bundle
4. Track loaded bundle IDs in a Set (no duplicates)
5. Append new rows below existing ones
6. Handle edge case: all bundles loaded (unlikely with 50K+ bundles but handle gracefully)

`TOTAL_BUNDLES` is a constant baked into the JS at build time.

**Done when:** Infinite scroll works, new bundles load seamlessly, no duplicate bundles.

### Step 5.6: Frontend Build Script

Write `pipeline/06_frontend/build.sh`:
1. Read total bundle count (from pipeline output or S3)
2. Inject `const TOTAL_BUNDLES = {M};` into site.js
3. Copy index.html + site.js to S3 `everytab-site/`
4. Invalidate CloudFront (if distribution exists)

**Done when:** Build script produces deployable frontend with correct bundle count.

---

## Phase 6: Integration & End-to-End Test (100K)

### Step 6.1: Run Full Pipeline (100K)

Execute all stages in sequence on EC2:
1. Verify hosts table has 100K entries (from Phase 1)
2. Run WARC parser (Phase 2) — should complete in minutes
3. Run icon downloader (Phase 3) — should complete in 10-30 minutes at 100K scale
4. Run best icon selection (Phase 4.1)
5. Run bundle generator (Phase 4.2-4.4)
6. Run frontend build (Phase 5.6)

**Validation:** Visit the CloudFront URL. The site should work:
- Tabs render with real favicons and titles
- Clicking works (iframe + external)
- Scrolling loads more tabs
- No JS console errors

### Step 6.2: Tune Parameters

Based on the 100K run:
- **ENTRIES_PER_BUNDLE:** Look at the live site. Does one bundle fill the screen? Too many tabs? Too few? Adjust.
- **Concurrency:** Was the icon download memory-stable? CPU-bound or network-bound? Adjust goroutine pool size.
- **Timeouts:** What was the error distribution? Are timeouts too aggressive? Too lenient?
- **Icon selection:** Do the selected icons look good? Any weird sizes or broken images?

Update CLI flag defaults based on findings.

### Step 6.3: Collect & Review Stats

Merge all `stats/*.json` into a single pipeline report. Review:
- Loss at each stage (domains → parsed → icons downloaded → icons selected → bundled)
- Time per stage
- Error patterns (are certain TLDs failing more? certain icon formats?)
- Storage usage (S3 icons bucket, S3 site bucket)

Identify any pipeline bugs or data quality issues. Fix before scaling up.

**Done when:** End-to-end works at 100K, parameters tuned, stats reviewed, bugs fixed.

---

## Phase 7: Full-Scale Run (30M)

### Step 7.1: Remove Limits, Re-run CC-Index Query

Update the DuckDB query to remove `LIMIT 100000`. Re-run.

Considerations:
- If httpfs takes >1hr, switch to downloading the parquet files first
- May need to increase RDS storage (30M rows with WARC paths ≈ 5-10GB)
- Monitor DuckDB memory usage

**Validation:** `SELECT COUNT(*) FROM hosts;` shows ~30M rows.

### Step 7.2: Run WARC Parser at Scale

Run with full concurrency against 30M hosts. Expected time: 2-6 hours.

Monitor:
- Throughput (hosts/sec)
- Error rate stability (should plateau, not climb)
- Postgres connection pool health
- Memory usage

### Step 7.3: Run Icon Downloader at Scale

This is the long pole — expected 12-48 hours.

Monitor continuously:
- icons/sec rate
- DNS cache hit rate (check Unbound stats: `unbound-control stats`)
- S3 upload rate
- Error rate by type
- Completion percentage

If too slow (projected >48hrs):
- Consider increasing concurrency (if memory allows)
- Consider spinning up fleet (add more EC2 instances running the same binary)
- Check if DNS is the bottleneck (Unbound stats)
- Check if S3 uploads are the bottleneck (batch or reduce HEAD checks)

### Step 7.4: Best Icon Selection + Bundle Generation

Run at full scale. Expected: 1-2 hours total.

Monitor bundle sizes — verify they're in the expected range with `ENTRIES_PER_BUNDLE` from tuning.

### Step 7.5: Rebuild Frontend + Deploy

Run frontend build with the real bundle count. Invalidate CloudFront.

**Validation:** Visit the live site. Browse around. Check:
- Tab variety (seeing diverse sites, not just one TLD)
- Icon quality (no broken images, reasonable sizes)
- Performance (bundles load quickly, no jank)
- Stats page / stats.json looks correct

**Done when:** Full-scale site is live and working.

---

## Phase 8: Backup & Teardown

### Step 8.1: Backup RDS to Homelab

```bash
# On EC2 (fast connection to RDS):
pg_dump -Fc $DATABASE_URL > everytab_dump.pgfc

# Transfer to homelab (from EC2 or direct):
scp everytab_dump.pgfc homelab:/backups/everytab/

# On homelab, verify restore:
pg_restore -d everytab_local everytab_dump.pgfc
psql everytab_local -c "SELECT COUNT(*) FROM hosts; SELECT COUNT(*) FROM icons;"
```

### Step 8.2: Backup Icons S3 to Homelab

```bash
# From homelab (or EC2 as intermediary):
aws s3 sync s3://everytab-icons/ /backups/everytab/icons/

# Verify file count matches:
ls /backups/everytab/icons/ | wc -l
# Compare with: aws s3 ls s3://everytab-icons/ | wc -l
```

### Step 8.3: Verify & Teardown

After confirming backups:

```bash
# Verify the live site still works (it only depends on everytab-site + CloudFront)
curl -s https://your-cloudfront-domain.net/ | head

# Teardown scanning infrastructure:
aws rds delete-db-instance --db-instance-identifier everytab --skip-final-snapshot
aws s3 rb s3://everytab-icons --force
aws ec2 terminate-instances --instance-ids i-xxxxx
```

**Done when:** Only `everytab-site` S3 bucket + CloudFront remain running. Monthly cost: ~$2-4.

---

## Development Notes

### What Can Be Parallelized

- **Frontend dev (Phase 5.1-5.5)** can happen at any time using mock data
- **AWS infra setup (Phase 0.2)** can happen while writing code locally
- **Icon downloader (Phase 3)** and **bundle generator (Phase 4)** are independent codebases, can be written in parallel

### Testing Strategy

- **Dry-run flags** on all Go programs: print what would happen without mutating DB/S3
- **--limit flags** on all Go programs: process a small subset quickly
- **Spot-checks:** after each stage, manually verify 5-10 random entries
- **Stats files:** compare counts between stages to catch data loss
- **100K dev set:** full pipeline at small scale before committing to a 24hr+ full run

### Common Pitfalls to Watch For

- **DuckDB CC-Index path:** The exact S3 path to parquet files changes per crawl. Check Common Crawl's website for the latest crawl ID and index location.
- **WARC record format:** WARC records have a specific envelope format (WARC/1.0 header, blank line, HTTP response). Don't assume the HTTP response starts at byte 0.
- **Relative icon URLs:** `/favicon.ico` is relative to root, but `favicon.ico` (no leading slash) is relative to the page path. Since we only have root pages (`/`), both resolve the same. But `../icons/fav.png` could be tricky — handle gracefully or skip.
- **ICO files are complex:** The ICO container format can embed BMP (with a modified header) or PNG. Many "ICO" files are actually just PNGs renamed to .ico. Check magic bytes, not file extension.
- **SVG rasterization:** Go doesn't have great native SVG support. Consider shelling out to `rsvg-convert` or `librsvg`, or use a Go library like `github.com/nicholasgasior/goresvg`. This can be a follow-up if SVG icons are rare.
- **Postgres connection limits:** RDS db.t3.medium has max_connections ≈ 80. With 1000 goroutines, we need connection pooling (pgx pool handles this). Set pool max to ~40 connections.
- **S3 eventual consistency:** After uploading an icon, a HEAD request might not find it immediately. For dedup checks, handle "not found" gracefully (just upload again — idempotent since key is content hash).
- **CloudFront caching:** After deploying new bundles, invalidate `/*` or set short TTL during development. For production, use long TTLs (bundles are immutable between crawls).