Joe Lothan e5035d9a28 updated PLAN.md, finished phase 5

2026-05-18 00:26:50 -04:00

29 KiB

Raw Blame History

EveryTab Implementation Plan

This plan builds the system described in ARCHITECTURE.md in incremental steps. We start with 100K hosts to validate the pipeline end-to-end, then scale to the full ~30M.

Each step has a clear deliverable and validation criteria. Steps are sequential — each phase builds on the previous.

Phase 0: Project Setup & AWS Infrastructure [COMPLETED]

Step 0.1: Repository Structure [COMPLETED]

everytab/
├── design.md
├── ARCHITECTURE.md
├── PLAN.md
├── infra/
│   ├── main.tf              # Terraform: all AWS resources
│   ├── terraform.tfvars.example
│   ├── ec2-userdata.sh      # EC2 bootstrap (Go, DuckDB, Unbound)
│   └── README.md            # Setup steps
├── pipeline/
│   ├── 01_cc_index/
│   │   └── schema.sql      # Postgres table definitions
│   ├── 02_warc_parse/
│   ├── 03_icon_download/
│   ├── 04_best_icon/
│   ├── 05_bundle_gen/
│   └── 06_frontend/
├── frontend/
├── stats/                   # gitignored
└── go.mod

Step 0.2: AWS Infrastructure (Terraform) [COMPLETED]

Infrastructure managed via infra/main.tf. Single file, uses var.scanning bool to switch phases:

terraform apply — creates all scanning resources (EC2, RDS, S3 icons, S3 site, IAM, security groups)
terraform apply -var="scanning=false" — destroys scanning resources, keeps site bucket
terraform destroy — removes everything

Resources created:

S3 everytab-icons (private), S3 everytab-site (for CloudFront later)
RDS Postgres 16, db.t3.medium, 20GB gp3
EC2 c5.xlarge, Amazon Linux 2023, 50GB gp3
Security groups (SSH from home IP, RDS from EC2 only)
IAM role + instance profile (S3 access only)
SSH key (Terraform-managed ed25519)

Step 0.3: EC2 Environment Setup [COMPLETED]

Bootstrap via infra/ec2-userdata.sh:

Go 1.22+, DuckDB (httpfs + postgres extensions), Unbound (recursive resolver), psql, tmux
Unbound configured as system resolver (systemd-resolved disabled)
DATABASE_URL in .bashrc
Schema applied: hosts + icons tables with indexes

Phase 1: CC-Index Query (Stage 1)

Step 1.1: Database Schema

Create the Postgres tables. Run via psql:

CREATE TABLE hosts (
    id SERIAL PRIMARY KEY,
    hostname TEXT NOT NULL UNIQUE,
    protocol TEXT NOT NULL,
    crawl_id TEXT NOT NULL,
    warc_filename TEXT NOT NULL,
    warc_record_offset BIGINT NOT NULL,
    warc_record_length INT NOT NULL,
    html_title TEXT,
    iframe_allowed BOOLEAN,
    best_icon_s3_key TEXT,
    parsed BOOLEAN DEFAULT FALSE
);

CREATE TABLE icons (
    id SERIAL PRIMARY KEY,
    host_id INT NOT NULL REFERENCES hosts(id),
    url TEXT NOT NULL,
    source TEXT NOT NULL,
    rel_type TEXT,
    rel_sizes TEXT,
    content_type TEXT,
    width INT,
    height INT,
    file_size INT,
    s3_key TEXT,
    scan_state TEXT DEFAULT 'unscanned',
    error TEXT
);

CREATE INDEX idx_hosts_parsed ON hosts(id) WHERE parsed = FALSE;
CREATE INDEX idx_icons_unscanned ON icons(id) WHERE scan_state = 'unscanned';
CREATE INDEX idx_icons_host_id ON icons(host_id);

Done when: Tables exist in RDS, schema matches ARCHITECTURE.md.

Step 1.2: DuckDB CC-Index Query (100K limit) [COMPLETED]

Script: pipeline/01_cc_index/query.sh

Uses DuckDB with aws extension (credential chain) to read parquet directly from s3://commoncrawl/.../*.parquet glob, with the postgres extension to write results into RDS. Auto-detects latest crawl ID from the CC API.

Deduplication via GROUP BY url_host_name with first(... ORDER BY ...) aggregates (hash aggregation — more memory-efficient than window functions).

Result: 100K hosts, 77% https / 23% http, completed in 692s.

Done when: 100K hosts in the database with valid WARC coordinates.

Step 1.3: Validate WARC Coordinates [COMPLETED]

Manually fetched WARC records with curl byte-range requests to data.commoncrawl.org. Confirmed valid WARC headers, HTTP response, and HTML with <title> and <link rel="icon"> tags.

Phase 2: WARC Parsing (Stage 2) [COMPLETED]

Steps 2.1-2.3 [COMPLETED]

Binary: pipeline/02_warc_parse/ (5 files: main.go, warc.go, parser.go, process.go, db.go, log.go)

Architecture:

Fetches WARC records via AWS SDK S3 byte-range GetObject (using EC2 instance profile credentials)
Parses WARC records with github.com/nlnwa/gowarc/v3
Parses HTML with golang.org/x/net/html tokenizer (lenient, stops at <body>)
Detects charset via golang.org/x/net/html/charset and converts to UTF-8
Sanitizes titles with strings.ToValidUTF8 as final safety net
Concurrent goroutine pool with configurable concurrency
Per-host log lines to stdout + optional log file
Panic recovery per goroutine (logs PANIC, doesn't mark row as parsed)
DB errors tracked and logged with DB_ERROR: prefix

CLI: ./warc_parse --db URL [--concurrency N] [--batch-size N] [--limit N] [--dry-run] [--log-file PATH] [--log-errors-only]

Result (100K hosts, concurrency 500):

Duration: 5m31s (~300 hosts/sec)
Titles found: 93,384 (93%)
Icons found: 201,780 (~2 per host)
Iframe blocked: 17,855 (18%)
Fetch errors: 3
DB errors: 0
Panics: 0

Phase 3: Icon Download (Stage 3) [COMPLETED]

Steps 3.1-3.3 [COMPLETED]

Binary: pipeline/03_icon_download/ (6 files: main.go, download.go, image.go, s3.go, db.go, log.go)

Architecture:

Channel-based work distribution: producer goroutine claims batches, N worker goroutines consume from buffered channel (no worker starvation)
Shared http.Transport for connection pooling / TLS session reuse
Content-addressed S3 storage (SHA-256 hash as key, dedup via HeadObject before upload)
Magic byte validation (PNG, GIF, JPEG, ICO, BMP, WebP, SVG)
ICO directory parsing for dimensions (picks largest ≤64x64)
Filters to eligible icons only: favicon_ico + link_rel with no declared size or ≤64x64
md5(id) shuffle in claim query to spread requests across hosts
Panic recovery per worker, DB errors tracked and logged

CLI: ./icon_download --db URL [--s3-bucket NAME] [--concurrency N] [--batch-size N] [--timeout D] [--max-size N] [--limit N] [--dry-run] [--log-file PATH] [--log-errors-only]

Result (100K hosts, ~224K eligible icons):

Duration: 10m36s (351 icons/sec)
Completed: 156,214 (70%)
Failed: 67,459 (30% — mostly HTTP 404s from stale crawl data)
Dedup hits: 55,771 (25% — shared Wix/WordPress/hosted platform favicons)
Downloaded: 1.9GB
DNS errors: 1,668 | Timeouts: 2,129 | HTTP errors: 47,565 | Invalid: 11,803 | Too large: 777
DB errors: 0 | Panics: 0

Phase 4: Best Icon Selection & Bundle Generation (Stages 4-5) [COMPLETED]

Step 4.1: Best Icon Selection SQL [COMPLETED]

Script: pipeline/04_best_icon/select.sql

Selects the best icon per host using DISTINCT ON with priority ordering. Excludes SVGs (can't rasterize) and ≤2x2 icons (tracking pixels). See ARCHITECTURE.md for the full decision flow.

Result: 70,366 hosts got an icon (72%), 23,018 have title but no icon.

Steps 4.2-4.4: Bundle Generator [COMPLETED]

Binary: pipeline/05_bundle_gen/ (6 files: main.go, bundle.go, convert.go, db.go, s3.go, log.go)

Architecture:

Queries all hosts with titles (randomized), concurrently downloads best icon from S3 icons bucket
Uses github.com/biessek/golang-ico for ICO decoding (handles all bit depths including palette-based 1/4/8bpp)
image.Decode handles PNG/GIF/JPEG/WebP/BMP/ICO via registered decoders. SVGs excluded.
Icons >128px downscaled to 32x32 (nearest-neighbor). Icons ≤128px kept as-is.
Re-encodes all icons as PNG, base64-encoded inline in bundle JSON.
Panic recovery per icon conversion (malformed ICO files in the library)
Concurrent S3 downloads with configurable concurrency (default 50)

CLI: ./bundle_gen --db URL [--icons-bucket NAME] [--site-bucket NAME] [--entries-per-bundle N] [--concurrency N] [--limit N] [--dry-run] [--output-dir DIR] [--log-file PATH] [--log-errors-only]

Result (93K hosts with titles, 70K with icons):

Duration: 1m30s
Bundles created: 779 (120 entries each, last bundle partial)
Total size: 165MB (avg 216KB per bundle)
Convert errors: 1,263 (1,077 SVGs + 186 other — panics, truncated files, corrupt GIFs)
S3: 779 JSON files in everytab-site/tabs/

Phase 5: Frontend (Stage 6) [COMPLETED — v1]

Steps 5.1-5.6 [COMPLETED]

Files: frontend/index.html and frontend/site.js

Architecture:

Vanilla JS, no framework. Two files: HTML (with inline CSS) + JS.
Fetches random bundle JSONs from tabs/{N}.json, renders tabs as rows filling the viewport.
Seeded PRNG (Date.now() + mulberry32) — every visitor sees unique tab arrangement.
Infinite scroll: loads more bundles as user approaches the bottom.
Tracks loaded bundle IDs in a Set to avoid duplicates.

Tab rendering:

Browser-specific tab styling via navigator.userAgent detection (Chrome, Firefox, Safari).
Inactive tab appearance by default, selected/active style when iframe is open.
Light mode default, auto-switches to dark mode via prefers-color-scheme.
Bidirectional marquee: each row randomly scrolls left or right at different speeds (90-150s per cycle).
Tabs duplicated in DOM for seamless marquee loop (translateX(-50%)).
Hover shows full title as native tooltip.
External link indicator (↗) on tabs that don't allow iframes.

Iframe viewer:

Inline, not overlay — opens between tab rows, pushes content down (75vh height).
Header shows favicon, title, external link, and close button.
Sandboxed iframe (allow-scripts allow-same-origin allow-forms).
Close via X button, Escape key.
Only one viewer open at a time.

TOTAL_BUNDLES baked into HTML at build time. Build script (pipeline/06_frontend/build.sh) still TODO — currently hardcoded.

Phase 6: Integration & End-to-End Test (100K)

Step 6.1: Run Full Pipeline (100K)

Execute all stages in sequence on EC2:

Verify hosts table has 100K entries (from Phase 1)
Run WARC parser (Phase 2) — should complete in minutes
Run icon downloader (Phase 3) — should complete in 10-30 minutes at 100K scale
Run best icon selection (Phase 4.1)
Run bundle generator (Phase 4.2-4.4)
Run frontend build (Phase 5.6)

Validation: Visit the CloudFront URL. The site should work:

Tabs render with real favicons and titles
Clicking works (iframe + external)
Scrolling loads more tabs
No JS console errors

Step 6.2: Tune Parameters

Based on the 100K run:

ENTRIES_PER_BUNDLE: Look at the live site. Does one bundle fill the screen? Too many tabs? Too few? Adjust.
Concurrency: Was the icon download memory-stable? CPU-bound or network-bound? Adjust goroutine pool size.
Timeouts: What was the error distribution? Are timeouts too aggressive? Too lenient?
Icon selection: Do the selected icons look good? Any weird sizes or broken images?

Update CLI flag defaults based on findings.

Step 6.3: Collect & Review Stats

Merge all stats/*.json into a single pipeline report. Review:

Loss at each stage (domains → parsed → icons downloaded → icons selected → bundled)
Time per stage
Error patterns (are certain TLDs failing more? certain icon formats?)
Storage usage (S3 icons bucket, S3 site bucket)

Identify any pipeline bugs or data quality issues. Fix before scaling up.

Done when: End-to-end works at 100K, parameters tuned, stats reviewed, bugs fixed.

Phase 7: Full-Scale Run (30M)

Step 7.1: Remove Limits, Re-run CC-Index Query

Update the DuckDB query to remove LIMIT 100000. Re-run.

Considerations:

If httpfs takes >1hr, switch to downloading the parquet files first
May need to increase RDS storage (30M rows with WARC paths ≈ 5-10GB)
Monitor DuckDB memory usage

Validation: SELECT COUNT(*) FROM hosts; shows ~30M rows.

Step 7.2: Run WARC Parser at Scale

Run with full concurrency against 30M hosts. Expected time: 2-6 hours.

Monitor:

Throughput (hosts/sec)
Error rate stability (should plateau, not climb)
Postgres connection pool health
Memory usage

Step 7.3: Run Icon Downloader at Scale

This is the long pole — expected 12-48 hours.

Monitor continuously:

icons/sec rate
DNS cache hit rate (check Unbound stats: unbound-control stats)
S3 upload rate
Error rate by type
Completion percentage

If too slow (projected >48hrs):

Consider increasing concurrency (if memory allows)
Consider spinning up fleet (add more EC2 instances running the same binary)
Check if DNS is the bottleneck (Unbound stats)
Check if S3 uploads are the bottleneck (batch or reduce HEAD checks)

Step 7.4: Best Icon Selection + Bundle Generation

Run at full scale. Expected: 1-2 hours total.

Monitor bundle sizes — verify they're in the expected range with ENTRIES_PER_BUNDLE from tuning.

Step 7.5: Rebuild Frontend + Deploy

Run frontend build with the real bundle count. Invalidate CloudFront.

Validation: Visit the live site. Browse around. Check:

Tab variety (seeing diverse sites, not just one TLD)
Icon quality (no broken images, reasonable sizes)
Performance (bundles load quickly, no jank)
Stats page / stats.json looks correct

Done when: Full-scale site is live and working.

Phase 8: Backup & Teardown

Step 8.1: Backup RDS to Homelab

# On EC2 (fast connection to RDS):
pg_dump -Fc $DATABASE_URL > everytab_dump.pgfc

# Transfer to homelab (from EC2 or direct):
scp everytab_dump.pgfc homelab:/backups/everytab/

# On homelab, verify restore:
pg_restore -d everytab_local everytab_dump.pgfc
psql everytab_local -c "SELECT COUNT(*) FROM hosts; SELECT COUNT(*) FROM icons;"

Step 8.2: Backup Icons S3 to Homelab

# From homelab (or EC2 as intermediary):
aws s3 sync s3://everytab-icons/ /backups/everytab/icons/

# Verify file count matches:
ls /backups/everytab/icons/ | wc -l
# Compare with: aws s3 ls s3://everytab-icons/ | wc -l

Step 8.3: Verify & Teardown

After confirming backups:

# Verify the live site still works (it only depends on everytab-site + CloudFront)
curl -s https://your-cloudfront-domain.net/ | head

# Teardown scanning infrastructure:
aws rds delete-db-instance --db-instance-identifier everytab --skip-final-snapshot
aws s3 rb s3://everytab-icons --force
aws ec2 terminate-instances --instance-ids i-xxxxx

Done when: Only everytab-site S3 bucket + CloudFront remain running. Monthly cost: ~$2-4.

Development Notes

Execution Order

Phases are sequential: 0 → 1 → 2 → 3 → 4 → 5 → 6 → 7 → 8. Frontend (Phase 5) uses real data from the 100K pipeline run. The only thing that can be developed ahead of time is writing Go code locally before EC2 is ready (compile-test locally, run on EC2).

Progress & Observability

All Go programs have two output modes running simultaneously:

Per-item log lines (stdout, above the progress bar):

WARC parser: parsed: example.com 200 "Example Domai..." ok or parsed: broken.net 200 "" err:no_title
Icon downloader: icon: https://example.com/favicon.ico 32x32 png 4.2KB ok or icon: https://fail.org/favicon.ico err:timeout
Bundle generator: bundle: 0042.json 120 entries 247KB ok

Each line is a short, fixed-format summary — hostname/URL, key result, and status. Keeps it scannable when running live.

Log file (--log-file path/to/out.log): If provided, mirror all per-item log lines to disk. For full-scale runs, consider using --log-errors-only flag to only write error lines to the log file (avoids filling disk with 30M success lines). Without --log-file, logs only go to stdout.

Progress bar (bottom of terminal, schollz/progressbar):

Items processed / total items
Processing rate (items/sec)
ETA
Error count

On completion, each program prints a summary line and writes its stats JSON (with started_at, finished_at, duration_seconds, and stage-specific counters).

Testing Strategy

Dry-run flags on all Go programs: print what would happen without mutating DB/S3
--limit flags on all Go programs: process a small subset quickly
Spot-checks: after each stage, manually verify 5-10 random entries
Stats files: compare counts between stages to catch data loss
100K dev set: full pipeline at small scale before committing to a 24hr+ full run