added PLAN.md with initial dev plan
This commit is contained in:
parent
a327fb3db3
commit
c50be97fd7
1 changed files with 687 additions and 0 deletions
687
PLAN.md
Normal file
687
PLAN.md
Normal file
|
|
@ -0,0 +1,687 @@
|
|||
# EveryTab Implementation Plan
|
||||
|
||||
This plan builds the system described in ARCHITECTURE.md in incremental steps. We start with 100K hosts to validate the pipeline end-to-end, then scale to the full ~30M.
|
||||
|
||||
Each step has a clear deliverable and validation criteria. Steps within a phase are sequential; some phases can overlap (noted where applicable).
|
||||
|
||||
---
|
||||
|
||||
## Phase 0: Project Setup & AWS Infrastructure
|
||||
|
||||
### Step 0.1: Repository Structure
|
||||
|
||||
Create the project layout:
|
||||
|
||||
```
|
||||
everytab/
|
||||
├── design.md
|
||||
├── ARCHITECTURE.md
|
||||
├── PLAN.md
|
||||
├── infra/ # AWS CLI scripts for setup/teardown
|
||||
│ ├── setup.sh # Create RDS, S3 buckets, security groups
|
||||
│ ├── teardown.sh # Delete non-permanent resources
|
||||
│ └── ec2-userdata.sh # EC2 bootstrap (install Go, DuckDB, Unbound)
|
||||
├── pipeline/
|
||||
│ ├── 01_cc_index/ # DuckDB query scripts
|
||||
│ ├── 02_warc_parse/ # Go program
|
||||
│ ├── 03_icon_download/# Go program
|
||||
│ ├── 04_best_icon/ # SQL script
|
||||
│ ├── 05_bundle_gen/ # Go program
|
||||
│ └── 06_frontend/ # Build script, templates
|
||||
├── frontend/
|
||||
│ ├── index.html
|
||||
│ └── site.js
|
||||
├── stats/ # Stats output from each stage (gitignored)
|
||||
└── go.mod # Shared Go module for pipeline programs
|
||||
```
|
||||
|
||||
**Done when:** Repo structure exists, `go.mod` initialized, `.gitignore` covers stats/ and any local config.
|
||||
|
||||
### Step 0.2: AWS Infrastructure (Manual CLI)
|
||||
|
||||
Create resources using AWS CLI commands in `infra/setup.sh`:
|
||||
|
||||
1. **S3 buckets:**
|
||||
- `everytab-icons` (private, no public access)
|
||||
- `everytab-site` (private, accessed via CloudFront OAC)
|
||||
|
||||
2. **RDS Postgres:**
|
||||
- `db.t3.medium`, 20GB storage (expandable), Postgres 16
|
||||
- In a VPC, security group allows inbound 5432 from EC2 security group
|
||||
- No public access (EC2 connects within VPC)
|
||||
- No multi-AZ (dev, not production)
|
||||
- Set a strong password, store in a local `.env` (gitignored)
|
||||
|
||||
3. **EC2 instance:**
|
||||
- `c5.xlarge` (4 vCPU, 8GB RAM) — enough for Go concurrency + Unbound cache
|
||||
- Amazon Linux 2023 or Ubuntu 24.04
|
||||
- Security group: allow SSH (from your IP), allow outbound all
|
||||
- Same VPC/subnet as RDS
|
||||
- Key pair for SSH access
|
||||
|
||||
4. **CloudFront distribution:**
|
||||
- Origin: `everytab-site` S3 bucket (OAC)
|
||||
- Default cache behavior: cache everything, Brotli+Gzip compression
|
||||
- Can set up now or defer to Phase 2
|
||||
|
||||
5. **IAM role for EC2:**
|
||||
- S3 read/write to both buckets
|
||||
- Attach as instance profile
|
||||
|
||||
**Validation:** SSH into EC2, confirm `psql` can connect to RDS, confirm `aws s3 ls` shows both buckets.
|
||||
|
||||
**Done when:** All resources exist, EC2 can reach RDS and S3.
|
||||
|
||||
### Step 0.3: EC2 Environment Setup
|
||||
|
||||
Bootstrap script (`infra/ec2-userdata.sh` or run manually):
|
||||
|
||||
1. Install Go (latest stable, 1.22+)
|
||||
2. Install DuckDB CLI
|
||||
3. Install Unbound, configure as recursive resolver:
|
||||
- `/etc/unbound/unbound.conf`: recursive mode, no forwarding, listen on 127.0.0.1
|
||||
- High cache: `msg-cache-size: 512m`, `rrset-cache-size: 1g`
|
||||
- `cache-min-ttl: 3600`
|
||||
- `prefetch: yes`
|
||||
- `num-threads: 4`
|
||||
4. Set `/etc/resolv.conf` → `nameserver 127.0.0.1`
|
||||
5. Install `psql` client, `pg_dump`
|
||||
6. Confirm DuckDB httpfs extension works: `INSTALL httpfs; LOAD httpfs;`
|
||||
|
||||
**Validation:**
|
||||
- `go version` works
|
||||
- `duckdb -c "INSTALL httpfs; LOAD httpfs; SELECT 1;"` works
|
||||
- `dig example.com @127.0.0.1` resolves (Unbound working)
|
||||
- `psql $DATABASE_URL -c "SELECT 1;"` connects to RDS
|
||||
|
||||
**Done when:** EC2 is a working development environment for all pipeline stages.
|
||||
|
||||
---
|
||||
|
||||
## Phase 1: CC-Index Query (Stage 1)
|
||||
|
||||
### Step 1.1: Database Schema
|
||||
|
||||
Create the Postgres tables. Run via `psql`:
|
||||
|
||||
```sql
|
||||
CREATE TABLE hosts (
|
||||
id SERIAL PRIMARY KEY,
|
||||
hostname TEXT NOT NULL UNIQUE,
|
||||
protocol TEXT NOT NULL,
|
||||
crawl_id TEXT NOT NULL,
|
||||
warc_filename TEXT NOT NULL,
|
||||
warc_record_offset BIGINT NOT NULL,
|
||||
warc_record_length INT NOT NULL,
|
||||
html_title TEXT,
|
||||
iframe_allowed BOOLEAN,
|
||||
best_icon_s3_key TEXT,
|
||||
parsed BOOLEAN DEFAULT FALSE
|
||||
);
|
||||
|
||||
CREATE TABLE icons (
|
||||
id SERIAL PRIMARY KEY,
|
||||
host_id INT NOT NULL REFERENCES hosts(id),
|
||||
url TEXT NOT NULL,
|
||||
source TEXT NOT NULL,
|
||||
rel_type TEXT,
|
||||
rel_sizes TEXT,
|
||||
content_type TEXT,
|
||||
width INT,
|
||||
height INT,
|
||||
file_size INT,
|
||||
s3_key TEXT,
|
||||
scan_state TEXT DEFAULT 'unscanned',
|
||||
error TEXT
|
||||
);
|
||||
|
||||
CREATE INDEX idx_hosts_parsed ON hosts(id) WHERE parsed = FALSE;
|
||||
CREATE INDEX idx_icons_unscanned ON icons(id) WHERE scan_state = 'unscanned';
|
||||
CREATE INDEX idx_icons_host_id ON icons(host_id);
|
||||
```
|
||||
|
||||
**Done when:** Tables exist in RDS, schema matches ARCHITECTURE.md.
|
||||
|
||||
### Step 1.2: DuckDB CC-Index Query (100K limit)
|
||||
|
||||
Write `pipeline/01_cc_index/query.sql` (or a shell script wrapping DuckDB CLI).
|
||||
|
||||
The script:
|
||||
1. Connects DuckDB to RDS via the postgres extension
|
||||
2. Queries the CC-Index parquet files via httpfs (latest crawl)
|
||||
3. Filters per ARCHITECTURE.md criteria
|
||||
4. Deduplicates per hostname (prefer https)
|
||||
5. Limits to 100,000 rows for dev
|
||||
6. Inserts directly into the hosts table
|
||||
|
||||
Key considerations:
|
||||
- Find the latest crawl index path (e.g., `s3://commoncrawl/cc-index/collections/CC-MAIN-2026-05/indexes/cdx-00*.parquet` — verify actual path structure)
|
||||
- DuckDB postgres extension: `INSTALL postgres; LOAD postgres; ATTACH 'dbname=... host=... ...' AS pg (TYPE POSTGRES);`
|
||||
- The dedup logic: partition by hostname, order by protocol (https first), take first row
|
||||
- Add `LIMIT 100000` for dev, remove for full run
|
||||
- Time the query — if httpfs takes >1hr, switch to downloading parquet first
|
||||
|
||||
**Validation:**
|
||||
- `SELECT COUNT(*) FROM hosts;` returns ~100,000
|
||||
- `SELECT protocol, COUNT(*) FROM hosts GROUP BY protocol;` shows mostly https
|
||||
- `SELECT * FROM hosts LIMIT 5;` shows sane data (real hostnames, valid WARC paths)
|
||||
- Spot-check: pick a few hostnames, verify they're real websites
|
||||
|
||||
**Stats to emit:** `stats/01_cc_index.json` with total_domains, https_count, http_count, query_time_seconds.
|
||||
|
||||
**Done when:** 100K hosts in the database with valid WARC coordinates.
|
||||
|
||||
### Step 1.3: Validate WARC Coordinates
|
||||
|
||||
Quick sanity check — before writing the full WARC parser, confirm we can actually fetch WARC records:
|
||||
|
||||
```bash
|
||||
# Pick a random row
|
||||
psql -c "SELECT warc_filename, warc_record_offset, warc_record_length FROM hosts ORDER BY random() LIMIT 1;"
|
||||
|
||||
# Fetch it with curl byte-range
|
||||
curl -r $OFFSET-$((OFFSET + LENGTH - 1)) "https://data.commoncrawl.org/$WARC_FILENAME" | head -c 500
|
||||
```
|
||||
|
||||
Should see a WARC record header followed by HTTP response headers and HTML.
|
||||
|
||||
**Done when:** We can manually fetch 3-5 WARC records and see valid HTML content.
|
||||
|
||||
---
|
||||
|
||||
## Phase 2: WARC Parsing (Stage 2)
|
||||
|
||||
### Step 2.1: Go Project Setup
|
||||
|
||||
Set up the shared Go module and the WARC parser binary:
|
||||
|
||||
```
|
||||
pipeline/02_warc_parse/
|
||||
├── main.go # Entry point, CLI flags, orchestration
|
||||
├── warc.go # WARC record fetching (S3 byte-range)
|
||||
├── parser.go # HTML parsing (title, link rel=icon, iframe headers)
|
||||
└── db.go # Postgres batch read/write
|
||||
```
|
||||
|
||||
Dependencies:
|
||||
- `github.com/jackc/pgx/v5` — Postgres driver (pool, batch operations)
|
||||
- `golang.org/x/net/html` — Lenient HTML parser
|
||||
- Standard library `net/http` for S3 byte-range requests
|
||||
|
||||
CLI flags:
|
||||
- `--db` connection string
|
||||
- `--batch-size` (default 500)
|
||||
- `--concurrency` (default 1000)
|
||||
- `--dry-run` (print parsed results, don't write to DB)
|
||||
- `--limit` (process at most N rows, for testing)
|
||||
|
||||
**Done when:** Project compiles, connects to DB, can read a batch of hosts rows.
|
||||
|
||||
### Step 2.2: WARC Fetch + Parse Logic
|
||||
|
||||
Implement:
|
||||
1. Byte-range fetch from `https://data.commoncrawl.org/{warc_filename}`
|
||||
2. Parse WARC record envelope (find the HTTP response within)
|
||||
3. Extract HTTP response headers:
|
||||
- `X-Frame-Options` → if present and not `ALLOWALL`, iframe_allowed = false
|
||||
- `Content-Security-Policy` → check for `frame-ancestors` directive
|
||||
4. Parse HTML body:
|
||||
- Extract `<title>` content (first title tag, truncate at 512 chars)
|
||||
- Extract all `<link rel="icon">` and `<link rel="shortcut icon">`:
|
||||
- href (resolve relative URLs against `{protocol}://{hostname}/`)
|
||||
- type attribute (if present)
|
||||
- sizes attribute (if present)
|
||||
- Ignore data: URIs, ignore links to other domains' icons for now
|
||||
|
||||
**Dry-run test:** Run with `--limit 100 --dry-run` and inspect output. Check:
|
||||
- Titles look reasonable (not empty, not garbage)
|
||||
- Icon URLs are well-formed (absolute, correct protocol)
|
||||
- iframe_allowed is set correctly (spot-check against real sites)
|
||||
|
||||
**Done when:** Can parse 100 WARC records correctly with `--dry-run` showing reasonable output.
|
||||
|
||||
### Step 2.3: Batch DB Writes + Full 100K Run
|
||||
|
||||
Implement the database write path:
|
||||
1. For each parsed host: UPDATE hosts SET html_title, iframe_allowed, parsed = TRUE
|
||||
2. For each host: INSERT INTO icons (host_id, url, source='favicon_ico') for `/favicon.ico`
|
||||
3. For each discovered link rel=icon: INSERT INTO icons (host_id, url, source='link_rel', rel_type, rel_sizes)
|
||||
4. Use batch/bulk operations (pgx CopyFrom or batch INSERT)
|
||||
|
||||
Run against the full 100K hosts:
|
||||
- Monitor throughput (hosts/sec)
|
||||
- Watch for errors (log to stderr)
|
||||
|
||||
**Validation:**
|
||||
- `SELECT COUNT(*) FROM hosts WHERE parsed = TRUE;` should approach 100,000
|
||||
- `SELECT COUNT(*) FROM icons;` should be > 100,000 (at minimum one /favicon.ico per host)
|
||||
- `SELECT COUNT(*) FROM hosts WHERE html_title IS NOT NULL;` — expect 90%+
|
||||
- `SELECT source, COUNT(*) FROM icons GROUP BY source;` — see the split
|
||||
- `SELECT COUNT(*) FROM hosts WHERE iframe_allowed = FALSE;` — expect 30-50%
|
||||
- Spot-check: pick some hosts, verify title matches the actual site
|
||||
|
||||
**Stats:** `stats/02_warc_parse.json`
|
||||
|
||||
**Done when:** All 100K hosts parsed, icons table populated, stats look reasonable.
|
||||
|
||||
---
|
||||
|
||||
## Phase 3: Icon Download (Stage 3)
|
||||
|
||||
### Step 3.1: Icon Downloader Go Program
|
||||
|
||||
```
|
||||
pipeline/03_icon_download/
|
||||
├── main.go # Entry point, CLI flags, worker pool
|
||||
├── downloader.go # HTTP fetch with timeouts, size limits
|
||||
├── decoder.go # Image validation + dimension extraction
|
||||
├── s3.go # Upload to everytab-icons bucket
|
||||
└── db.go # Claim work, update results
|
||||
```
|
||||
|
||||
CLI flags:
|
||||
- `--db` connection string
|
||||
- `--s3-bucket` (default `everytab-icons`)
|
||||
- `--concurrency` (default 1000, tunable)
|
||||
- `--batch-size` (default 500)
|
||||
- `--timeout` (default 10s)
|
||||
- `--max-size` (default 512KB)
|
||||
- `--dry-run` (fetch and validate but don't upload to S3 or update DB)
|
||||
- `--limit` (process at most N icons)
|
||||
|
||||
Dependencies:
|
||||
- `github.com/jackc/pgx/v5` — Postgres
|
||||
- `github.com/aws/aws-sdk-go-v2` — S3 uploads
|
||||
- Standard library `image` + sub-packages for decoding dimensions
|
||||
- A library for ICO parsing (e.g., `github.com/AvraamMavridis/randomcolor` — actually find a proper ICO decoder, or write a simple one that reads the ICO header for directory entries)
|
||||
|
||||
### Step 3.2: Work Claiming + Download Logic
|
||||
|
||||
Implement:
|
||||
1. Claim batch with randomized order (md5 shuffle, FOR UPDATE SKIP LOCKED)
|
||||
2. For each icon URL:
|
||||
- HTTP GET with timeouts (5s dial, 10s total)
|
||||
- Read up to max-size bytes, abort if exceeded
|
||||
- Validate magic bytes (PNG: `\x89PNG`, GIF: `GIF8`, ICO: `\x00\x00\x01\x00`, etc.)
|
||||
- Determine actual content type from magic bytes (don't trust HTTP Content-Type)
|
||||
- Decode dimensions:
|
||||
- PNG/GIF/JPEG/WebP/BMP: read image header (Go `image.DecodeConfig`)
|
||||
- ICO: parse directory entries, find largest at standard size ≤64x64
|
||||
- SVG: set width=NULL, height=NULL
|
||||
- Compute SHA-256 of full content
|
||||
- Check if S3 key exists (HEAD request); if yes, skip upload (dedup)
|
||||
- Upload to S3 if new
|
||||
3. Update icons row with results (or error)
|
||||
|
||||
**Dry-run test:** `--limit 200 --dry-run` — prints what it would do for 200 icons. Check URLs, detected types, dimensions.
|
||||
|
||||
**Done when:** Can download, validate, and upload icons for a small batch.
|
||||
|
||||
### Step 3.3: Full 100K Icon Run
|
||||
|
||||
Run against all icons in the database (likely 150K-300K icon rows for 100K hosts).
|
||||
|
||||
Monitor:
|
||||
- icons/sec throughput
|
||||
- Error breakdown (DNS failures, timeouts, HTTP errors, invalid images)
|
||||
- S3 dedup hit rate
|
||||
- Memory usage (adjust concurrency if needed)
|
||||
|
||||
**Validation:**
|
||||
- `SELECT scan_state, COUNT(*) FROM icons GROUP BY scan_state;` — expect mostly completed, some failed
|
||||
- `SELECT error, COUNT(*) FROM icons WHERE scan_state = 'failed' GROUP BY error ORDER BY count DESC LIMIT 20;` — understand failure modes
|
||||
- `aws s3 ls s3://everytab-icons/ | wc -l` — confirm icons in S3
|
||||
- Spot-check: download a few icons from S3, open them, verify they're valid images
|
||||
|
||||
**Stats:** `stats/03_icon_download.json`
|
||||
|
||||
**Done when:** Icon download complete for 100K dev set, error rate understood, S3 populated.
|
||||
|
||||
---
|
||||
|
||||
## Phase 4: Best Icon Selection & Bundle Generation (Stages 4-5)
|
||||
|
||||
### Step 4.1: Best Icon Selection SQL
|
||||
|
||||
Write `pipeline/04_best_icon/select.sql`:
|
||||
|
||||
```sql
|
||||
UPDATE hosts h SET best_icon_s3_key = sub.s3_key
|
||||
FROM (
|
||||
SELECT DISTINCT ON (i.host_id) i.host_id, i.s3_key
|
||||
FROM icons i
|
||||
WHERE i.scan_state = 'completed'
|
||||
ORDER BY i.host_id,
|
||||
CASE
|
||||
WHEN i.width = i.height AND i.width IN (64, 48, 32, 16) THEN 0
|
||||
WHEN i.width = i.height AND i.width <= 64 THEN 1
|
||||
WHEN i.width IS NOT NULL AND i.width <= 64 AND i.height <= 64 THEN 2
|
||||
ELSE 3
|
||||
END,
|
||||
COALESCE(i.width, 0) DESC,
|
||||
CASE
|
||||
WHEN i.content_type IN ('image/png', 'image/gif', 'image/x-icon', 'image/vnd.microsoft.icon') THEN 0
|
||||
WHEN i.content_type = 'image/webp' THEN 1
|
||||
WHEN i.content_type = 'image/svg+xml' THEN 2
|
||||
ELSE 3
|
||||
END,
|
||||
i.file_size ASC
|
||||
) sub
|
||||
WHERE h.id = sub.host_id;
|
||||
```
|
||||
|
||||
**Validation:**
|
||||
- `SELECT COUNT(*) FROM hosts WHERE best_icon_s3_key IS NOT NULL;` — expect 60-80% of hosts
|
||||
- `SELECT COUNT(*) FROM hosts WHERE html_title IS NOT NULL AND best_icon_s3_key IS NULL;` — hosts with title but no icon (will still be in bundles)
|
||||
- Spot-check: for a few hosts, verify the selected icon is reasonable (correct size, valid)
|
||||
|
||||
**Stats:** `stats/04_best_icon.json`
|
||||
|
||||
**Done when:** best_icon_s3_key populated for hosts that have valid icons.
|
||||
|
||||
### Step 4.2: Bundle Generator Go Program
|
||||
|
||||
```
|
||||
pipeline/05_bundle_gen/
|
||||
├── main.go # Entry point, CLI flags
|
||||
├── db.go # Query hosts + icon keys
|
||||
├── convert.go # Icon format conversion → PNG
|
||||
├── bundle.go # Chunk + serialize JSON
|
||||
└── s3.go # Upload bundles to everytab-site
|
||||
```
|
||||
|
||||
CLI flags:
|
||||
- `--db` connection string
|
||||
- `--icons-bucket` (default `everytab-icons`)
|
||||
- `--site-bucket` (default `everytab-site`)
|
||||
- `--entries-per-bundle` (tunable, start at 120)
|
||||
- `--dry-run` (generate bundles to local disk, don't upload)
|
||||
- `--limit` (only process N hosts, for testing)
|
||||
|
||||
### Step 4.3: Icon Conversion Logic
|
||||
|
||||
Implement format conversion to PNG:
|
||||
1. Download icon from S3 by key
|
||||
2. Detect format from magic bytes
|
||||
3. Decode:
|
||||
- PNG: decode directly
|
||||
- ICO: parse container, extract image at recorded width/height, decode BMP or PNG within
|
||||
- GIF/JPEG/BMP/WebP: decode to RGBA
|
||||
- SVG: rasterize to 32x32 (use a Go SVG library, or shell out to `rsvg-convert` if simpler)
|
||||
4. Re-encode as PNG (optimized, don't upscale)
|
||||
5. Base64-encode
|
||||
|
||||
**Test:** Convert 50 icons of mixed formats manually, verify output PNGs look correct.
|
||||
|
||||
### Step 4.4: Bundle Assembly + Upload
|
||||
|
||||
Implement:
|
||||
1. Query all hosts WHERE html_title IS NOT NULL, randomize (ORDER BY random())
|
||||
2. For each host: fetch + convert its icon (or set empty string if no icon)
|
||||
3. Assemble entries into chunks of `ENTRIES_PER_BUNDLE`
|
||||
4. Serialize each chunk as JSON (`tabs/{n}.json`)
|
||||
5. Upload to S3 `everytab-site/tabs/`
|
||||
6. Record total bundle count
|
||||
|
||||
**Dry-run:** Generate bundles to local disk, inspect a few:
|
||||
- Valid JSON
|
||||
- Icons render in browser (paste a data:image/png;base64,... URI)
|
||||
- Entries have host, title, icon, icon_w, icon_h, iframe_ok
|
||||
|
||||
**Validation:**
|
||||
- Bundle files exist in S3
|
||||
- `aws s3 ls s3://everytab-site/tabs/ | wc -l` matches expected count
|
||||
- Random bundle can be fetched and parsed as JSON
|
||||
- Total hosts across all bundles = count of hosts with titles
|
||||
|
||||
**Stats:** `stats/05_bundle_gen.json`
|
||||
|
||||
**Done when:** All bundles uploaded to S3, JSON is valid, icons render.
|
||||
|
||||
---
|
||||
|
||||
## Phase 5: Frontend (Stage 6)
|
||||
|
||||
This phase can begin in parallel with Phase 3-4 using mock bundle data.
|
||||
|
||||
### Step 5.1: Mock Data for Frontend Dev
|
||||
|
||||
Generate 2-3 small mock bundle files (`tabs/0.json`, `tabs/1.json`, `tabs/2.json`) with ~20 entries each. Use real favicons (Google, GitHub, Wikipedia, etc.) manually base64-encoded. This lets us develop the frontend without waiting for the pipeline.
|
||||
|
||||
Serve locally with any static file server (`python -m http.server`).
|
||||
|
||||
**Done when:** Mock bundles exist and can be served locally.
|
||||
|
||||
### Step 5.2: Basic Tab Rendering
|
||||
|
||||
Build `frontend/index.html` and `frontend/site.js`:
|
||||
|
||||
1. HTML: minimal shell with a container div, inline CSS for tab styling
|
||||
2. JS: fetch a bundle, render tabs as rows filling the viewport
|
||||
3. Tab appearance: mimic Firefox tab shape (rounded top corners, slight border)
|
||||
4. Each tab shows favicon (16x16 or 32x32 img from data URI) + truncated title
|
||||
5. No-icon tabs show title only
|
||||
|
||||
Focus: get the visual density right. How many tabs fit across? How many rows fill the viewport? This determines `ENTRIES_PER_BUNDLE`.
|
||||
|
||||
**Done when:** Page renders tabs from a mock bundle. Visually looks like a page full of browser tabs.
|
||||
|
||||
### Step 5.3: Marquee Animation
|
||||
|
||||
Add horizontal marquee to each row:
|
||||
- CSS `@keyframes` animation, translateX
|
||||
- Each row at slightly different speed and direction (some left, some right)
|
||||
- Smooth, subtle movement — not distracting, just enough to feel alive
|
||||
- Rows need extra tabs beyond viewport width to avoid gaps during scroll
|
||||
|
||||
**Done when:** Rows scroll smoothly, no visual glitches at edges.
|
||||
|
||||
### Step 5.4: Interaction — Click, Iframe, Close
|
||||
|
||||
Implement tab click behavior:
|
||||
1. If `iframe_ok`: show an overlay with iframe loading the site (`{protocol}://{hostname}`)
|
||||
2. If `!iframe_ok`: open in new tab (`target="_blank"`, add rel="noopener")
|
||||
3. Visual indicator on tabs that will open externally (small icon/badge)
|
||||
4. Close overlay: X button + click-outside + Escape key
|
||||
|
||||
**Done when:** Clicking tabs works correctly for both iframe and external cases.
|
||||
|
||||
### Step 5.5: Infinite Scroll + Random Bundle Loading
|
||||
|
||||
Implement:
|
||||
1. Seeded PRNG using `Date.now()` — generates deterministic sequence of bundle indices
|
||||
2. On page load: fetch first bundle, render
|
||||
3. Scroll detection: when user approaches bottom, fetch next random bundle
|
||||
4. Track loaded bundle IDs in a Set (no duplicates)
|
||||
5. Append new rows below existing ones
|
||||
6. Handle edge case: all bundles loaded (unlikely with 50K+ bundles but handle gracefully)
|
||||
|
||||
`TOTAL_BUNDLES` is a constant baked into the JS at build time.
|
||||
|
||||
**Done when:** Infinite scroll works, new bundles load seamlessly, no duplicate bundles.
|
||||
|
||||
### Step 5.6: Frontend Build Script
|
||||
|
||||
Write `pipeline/06_frontend/build.sh`:
|
||||
1. Read total bundle count (from pipeline output or S3)
|
||||
2. Inject `const TOTAL_BUNDLES = {M};` into site.js
|
||||
3. Copy index.html + site.js to S3 `everytab-site/`
|
||||
4. Invalidate CloudFront (if distribution exists)
|
||||
|
||||
**Done when:** Build script produces deployable frontend with correct bundle count.
|
||||
|
||||
---
|
||||
|
||||
## Phase 6: Integration & End-to-End Test (100K)
|
||||
|
||||
### Step 6.1: Run Full Pipeline (100K)
|
||||
|
||||
Execute all stages in sequence on EC2:
|
||||
1. Verify hosts table has 100K entries (from Phase 1)
|
||||
2. Run WARC parser (Phase 2) — should complete in minutes
|
||||
3. Run icon downloader (Phase 3) — should complete in 10-30 minutes at 100K scale
|
||||
4. Run best icon selection (Phase 4.1)
|
||||
5. Run bundle generator (Phase 4.2-4.4)
|
||||
6. Run frontend build (Phase 5.6)
|
||||
|
||||
**Validation:** Visit the CloudFront URL. The site should work:
|
||||
- Tabs render with real favicons and titles
|
||||
- Clicking works (iframe + external)
|
||||
- Scrolling loads more tabs
|
||||
- No JS console errors
|
||||
|
||||
### Step 6.2: Tune Parameters
|
||||
|
||||
Based on the 100K run:
|
||||
- **ENTRIES_PER_BUNDLE:** Look at the live site. Does one bundle fill the screen? Too many tabs? Too few? Adjust.
|
||||
- **Concurrency:** Was the icon download memory-stable? CPU-bound or network-bound? Adjust goroutine pool size.
|
||||
- **Timeouts:** What was the error distribution? Are timeouts too aggressive? Too lenient?
|
||||
- **Icon selection:** Do the selected icons look good? Any weird sizes or broken images?
|
||||
|
||||
Update CLI flag defaults based on findings.
|
||||
|
||||
### Step 6.3: Collect & Review Stats
|
||||
|
||||
Merge all `stats/*.json` into a single pipeline report. Review:
|
||||
- Loss at each stage (domains → parsed → icons downloaded → icons selected → bundled)
|
||||
- Time per stage
|
||||
- Error patterns (are certain TLDs failing more? certain icon formats?)
|
||||
- Storage usage (S3 icons bucket, S3 site bucket)
|
||||
|
||||
Identify any pipeline bugs or data quality issues. Fix before scaling up.
|
||||
|
||||
**Done when:** End-to-end works at 100K, parameters tuned, stats reviewed, bugs fixed.
|
||||
|
||||
---
|
||||
|
||||
## Phase 7: Full-Scale Run (30M)
|
||||
|
||||
### Step 7.1: Remove Limits, Re-run CC-Index Query
|
||||
|
||||
Update the DuckDB query to remove `LIMIT 100000`. Re-run.
|
||||
|
||||
Considerations:
|
||||
- If httpfs takes >1hr, switch to downloading the parquet files first
|
||||
- May need to increase RDS storage (30M rows with WARC paths ≈ 5-10GB)
|
||||
- Monitor DuckDB memory usage
|
||||
|
||||
**Validation:** `SELECT COUNT(*) FROM hosts;` shows ~30M rows.
|
||||
|
||||
### Step 7.2: Run WARC Parser at Scale
|
||||
|
||||
Run with full concurrency against 30M hosts. Expected time: 2-6 hours.
|
||||
|
||||
Monitor:
|
||||
- Throughput (hosts/sec)
|
||||
- Error rate stability (should plateau, not climb)
|
||||
- Postgres connection pool health
|
||||
- Memory usage
|
||||
|
||||
### Step 7.3: Run Icon Downloader at Scale
|
||||
|
||||
This is the long pole — expected 12-48 hours.
|
||||
|
||||
Monitor continuously:
|
||||
- icons/sec rate
|
||||
- DNS cache hit rate (check Unbound stats: `unbound-control stats`)
|
||||
- S3 upload rate
|
||||
- Error rate by type
|
||||
- Completion percentage
|
||||
|
||||
If too slow (projected >48hrs):
|
||||
- Consider increasing concurrency (if memory allows)
|
||||
- Consider spinning up fleet (add more EC2 instances running the same binary)
|
||||
- Check if DNS is the bottleneck (Unbound stats)
|
||||
- Check if S3 uploads are the bottleneck (batch or reduce HEAD checks)
|
||||
|
||||
### Step 7.4: Best Icon Selection + Bundle Generation
|
||||
|
||||
Run at full scale. Expected: 1-2 hours total.
|
||||
|
||||
Monitor bundle sizes — verify they're in the expected range with `ENTRIES_PER_BUNDLE` from tuning.
|
||||
|
||||
### Step 7.5: Rebuild Frontend + Deploy
|
||||
|
||||
Run frontend build with the real bundle count. Invalidate CloudFront.
|
||||
|
||||
**Validation:** Visit the live site. Browse around. Check:
|
||||
- Tab variety (seeing diverse sites, not just one TLD)
|
||||
- Icon quality (no broken images, reasonable sizes)
|
||||
- Performance (bundles load quickly, no jank)
|
||||
- Stats page / stats.json looks correct
|
||||
|
||||
**Done when:** Full-scale site is live and working.
|
||||
|
||||
---
|
||||
|
||||
## Phase 8: Backup & Teardown
|
||||
|
||||
### Step 8.1: Backup RDS to Homelab
|
||||
|
||||
```bash
|
||||
# On EC2 (fast connection to RDS):
|
||||
pg_dump -Fc $DATABASE_URL > everytab_dump.pgfc
|
||||
|
||||
# Transfer to homelab (from EC2 or direct):
|
||||
scp everytab_dump.pgfc homelab:/backups/everytab/
|
||||
|
||||
# On homelab, verify restore:
|
||||
pg_restore -d everytab_local everytab_dump.pgfc
|
||||
psql everytab_local -c "SELECT COUNT(*) FROM hosts; SELECT COUNT(*) FROM icons;"
|
||||
```
|
||||
|
||||
### Step 8.2: Backup Icons S3 to Homelab
|
||||
|
||||
```bash
|
||||
# From homelab (or EC2 as intermediary):
|
||||
aws s3 sync s3://everytab-icons/ /backups/everytab/icons/
|
||||
|
||||
# Verify file count matches:
|
||||
ls /backups/everytab/icons/ | wc -l
|
||||
# Compare with: aws s3 ls s3://everytab-icons/ | wc -l
|
||||
```
|
||||
|
||||
### Step 8.3: Verify & Teardown
|
||||
|
||||
After confirming backups:
|
||||
|
||||
```bash
|
||||
# Verify the live site still works (it only depends on everytab-site + CloudFront)
|
||||
curl -s https://your-cloudfront-domain.net/ | head
|
||||
|
||||
# Teardown scanning infrastructure:
|
||||
aws rds delete-db-instance --db-instance-identifier everytab --skip-final-snapshot
|
||||
aws s3 rb s3://everytab-icons --force
|
||||
aws ec2 terminate-instances --instance-ids i-xxxxx
|
||||
```
|
||||
|
||||
**Done when:** Only `everytab-site` S3 bucket + CloudFront remain running. Monthly cost: ~$2-4.
|
||||
|
||||
---
|
||||
|
||||
## Development Notes
|
||||
|
||||
### What Can Be Parallelized
|
||||
|
||||
- **Frontend dev (Phase 5.1-5.5)** can happen at any time using mock data
|
||||
- **AWS infra setup (Phase 0.2)** can happen while writing code locally
|
||||
- **Icon downloader (Phase 3)** and **bundle generator (Phase 4)** are independent codebases, can be written in parallel
|
||||
|
||||
### Testing Strategy
|
||||
|
||||
- **Dry-run flags** on all Go programs: print what would happen without mutating DB/S3
|
||||
- **--limit flags** on all Go programs: process a small subset quickly
|
||||
- **Spot-checks:** after each stage, manually verify 5-10 random entries
|
||||
- **Stats files:** compare counts between stages to catch data loss
|
||||
- **100K dev set:** full pipeline at small scale before committing to a 24hr+ full run
|
||||
|
||||
### Common Pitfalls to Watch For
|
||||
|
||||
- **DuckDB CC-Index path:** The exact S3 path to parquet files changes per crawl. Check Common Crawl's website for the latest crawl ID and index location.
|
||||
- **WARC record format:** WARC records have a specific envelope format (WARC/1.0 header, blank line, HTTP response). Don't assume the HTTP response starts at byte 0.
|
||||
- **Relative icon URLs:** `/favicon.ico` is relative to root, but `favicon.ico` (no leading slash) is relative to the page path. Since we only have root pages (`/`), both resolve the same. But `../icons/fav.png` could be tricky — handle gracefully or skip.
|
||||
- **ICO files are complex:** The ICO container format can embed BMP (with a modified header) or PNG. Many "ICO" files are actually just PNGs renamed to .ico. Check magic bytes, not file extension.
|
||||
- **SVG rasterization:** Go doesn't have great native SVG support. Consider shelling out to `rsvg-convert` or `librsvg`, or use a Go library like `github.com/nicholasgasior/goresvg`. This can be a follow-up if SVG icons are rare.
|
||||
- **Postgres connection limits:** RDS db.t3.medium has max_connections ≈ 80. With 1000 goroutines, we need connection pooling (pgx pool handles this). Set pool max to ~40 connections.
|
||||
- **S3 eventual consistency:** After uploading an icon, a HEAD request might not find it immediately. For dedup checks, handle "not found" gracefully (just upload again — idempotent since key is content hash).
|
||||
- **CloudFront caching:** After deploying new bundles, invalidate `/*` or set short TTL during development. For production, use long TTLs (bundles are immutable between crawls).
|
||||
Loading…
Add table
Add a link
Reference in a new issue