687 lines
26 KiB
Markdown
687 lines
26 KiB
Markdown
# EveryTab Implementation Plan
|
|
|
|
This plan builds the system described in ARCHITECTURE.md in incremental steps. We start with 100K hosts to validate the pipeline end-to-end, then scale to the full ~30M.
|
|
|
|
Each step has a clear deliverable and validation criteria. Steps within a phase are sequential; some phases can overlap (noted where applicable).
|
|
|
|
---
|
|
|
|
## Phase 0: Project Setup & AWS Infrastructure
|
|
|
|
### Step 0.1: Repository Structure
|
|
|
|
Create the project layout:
|
|
|
|
```
|
|
everytab/
|
|
├── design.md
|
|
├── ARCHITECTURE.md
|
|
├── PLAN.md
|
|
├── infra/ # AWS CLI scripts for setup/teardown
|
|
│ ├── setup.sh # Create RDS, S3 buckets, security groups
|
|
│ ├── teardown.sh # Delete non-permanent resources
|
|
│ └── ec2-userdata.sh # EC2 bootstrap (install Go, DuckDB, Unbound)
|
|
├── pipeline/
|
|
│ ├── 01_cc_index/ # DuckDB query scripts
|
|
│ ├── 02_warc_parse/ # Go program
|
|
│ ├── 03_icon_download/# Go program
|
|
│ ├── 04_best_icon/ # SQL script
|
|
│ ├── 05_bundle_gen/ # Go program
|
|
│ └── 06_frontend/ # Build script, templates
|
|
├── frontend/
|
|
│ ├── index.html
|
|
│ └── site.js
|
|
├── stats/ # Stats output from each stage (gitignored)
|
|
└── go.mod # Shared Go module for pipeline programs
|
|
```
|
|
|
|
**Done when:** Repo structure exists, `go.mod` initialized, `.gitignore` covers stats/ and any local config.
|
|
|
|
### Step 0.2: AWS Infrastructure (Manual CLI)
|
|
|
|
Create resources using AWS CLI commands in `infra/setup.sh`:
|
|
|
|
1. **S3 buckets:**
|
|
- `everytab-icons` (private, no public access)
|
|
- `everytab-site` (private, accessed via CloudFront OAC)
|
|
|
|
2. **RDS Postgres:**
|
|
- `db.t3.medium`, 20GB storage (expandable), Postgres 16
|
|
- In a VPC, security group allows inbound 5432 from EC2 security group
|
|
- No public access (EC2 connects within VPC)
|
|
- No multi-AZ (dev, not production)
|
|
- Set a strong password, store in a local `.env` (gitignored)
|
|
|
|
3. **EC2 instance:**
|
|
- `c5.xlarge` (4 vCPU, 8GB RAM) — enough for Go concurrency + Unbound cache
|
|
- Amazon Linux 2023 or Ubuntu 24.04
|
|
- Security group: allow SSH (from your IP), allow outbound all
|
|
- Same VPC/subnet as RDS
|
|
- Key pair for SSH access
|
|
|
|
4. **CloudFront distribution:**
|
|
- Origin: `everytab-site` S3 bucket (OAC)
|
|
- Default cache behavior: cache everything, Brotli+Gzip compression
|
|
- Can set up now or defer to Phase 2
|
|
|
|
5. **IAM role for EC2:**
|
|
- S3 read/write to both buckets
|
|
- Attach as instance profile
|
|
|
|
**Validation:** SSH into EC2, confirm `psql` can connect to RDS, confirm `aws s3 ls` shows both buckets.
|
|
|
|
**Done when:** All resources exist, EC2 can reach RDS and S3.
|
|
|
|
### Step 0.3: EC2 Environment Setup
|
|
|
|
Bootstrap script (`infra/ec2-userdata.sh` or run manually):
|
|
|
|
1. Install Go (latest stable, 1.22+)
|
|
2. Install DuckDB CLI
|
|
3. Install Unbound, configure as recursive resolver:
|
|
- `/etc/unbound/unbound.conf`: recursive mode, no forwarding, listen on 127.0.0.1
|
|
- High cache: `msg-cache-size: 512m`, `rrset-cache-size: 1g`
|
|
- `cache-min-ttl: 3600`
|
|
- `prefetch: yes`
|
|
- `num-threads: 4`
|
|
4. Set `/etc/resolv.conf` → `nameserver 127.0.0.1`
|
|
5. Install `psql` client, `pg_dump`
|
|
6. Confirm DuckDB httpfs extension works: `INSTALL httpfs; LOAD httpfs;`
|
|
|
|
**Validation:**
|
|
- `go version` works
|
|
- `duckdb -c "INSTALL httpfs; LOAD httpfs; SELECT 1;"` works
|
|
- `dig example.com @127.0.0.1` resolves (Unbound working)
|
|
- `psql $DATABASE_URL -c "SELECT 1;"` connects to RDS
|
|
|
|
**Done when:** EC2 is a working development environment for all pipeline stages.
|
|
|
|
---
|
|
|
|
## Phase 1: CC-Index Query (Stage 1)
|
|
|
|
### Step 1.1: Database Schema
|
|
|
|
Create the Postgres tables. Run via `psql`:
|
|
|
|
```sql
|
|
CREATE TABLE hosts (
|
|
id SERIAL PRIMARY KEY,
|
|
hostname TEXT NOT NULL UNIQUE,
|
|
protocol TEXT NOT NULL,
|
|
crawl_id TEXT NOT NULL,
|
|
warc_filename TEXT NOT NULL,
|
|
warc_record_offset BIGINT NOT NULL,
|
|
warc_record_length INT NOT NULL,
|
|
html_title TEXT,
|
|
iframe_allowed BOOLEAN,
|
|
best_icon_s3_key TEXT,
|
|
parsed BOOLEAN DEFAULT FALSE
|
|
);
|
|
|
|
CREATE TABLE icons (
|
|
id SERIAL PRIMARY KEY,
|
|
host_id INT NOT NULL REFERENCES hosts(id),
|
|
url TEXT NOT NULL,
|
|
source TEXT NOT NULL,
|
|
rel_type TEXT,
|
|
rel_sizes TEXT,
|
|
content_type TEXT,
|
|
width INT,
|
|
height INT,
|
|
file_size INT,
|
|
s3_key TEXT,
|
|
scan_state TEXT DEFAULT 'unscanned',
|
|
error TEXT
|
|
);
|
|
|
|
CREATE INDEX idx_hosts_parsed ON hosts(id) WHERE parsed = FALSE;
|
|
CREATE INDEX idx_icons_unscanned ON icons(id) WHERE scan_state = 'unscanned';
|
|
CREATE INDEX idx_icons_host_id ON icons(host_id);
|
|
```
|
|
|
|
**Done when:** Tables exist in RDS, schema matches ARCHITECTURE.md.
|
|
|
|
### Step 1.2: DuckDB CC-Index Query (100K limit)
|
|
|
|
Write `pipeline/01_cc_index/query.sql` (or a shell script wrapping DuckDB CLI).
|
|
|
|
The script:
|
|
1. Connects DuckDB to RDS via the postgres extension
|
|
2. Queries the CC-Index parquet files via httpfs (latest crawl)
|
|
3. Filters per ARCHITECTURE.md criteria
|
|
4. Deduplicates per hostname (prefer https)
|
|
5. Limits to 100,000 rows for dev
|
|
6. Inserts directly into the hosts table
|
|
|
|
Key considerations:
|
|
- Find the latest crawl index path (e.g., `s3://commoncrawl/cc-index/collections/CC-MAIN-2026-05/indexes/cdx-00*.parquet` — verify actual path structure)
|
|
- DuckDB postgres extension: `INSTALL postgres; LOAD postgres; ATTACH 'dbname=... host=... ...' AS pg (TYPE POSTGRES);`
|
|
- The dedup logic: partition by hostname, order by protocol (https first), take first row
|
|
- Add `LIMIT 100000` for dev, remove for full run
|
|
- Time the query — if httpfs takes >1hr, switch to downloading parquet first
|
|
|
|
**Validation:**
|
|
- `SELECT COUNT(*) FROM hosts;` returns ~100,000
|
|
- `SELECT protocol, COUNT(*) FROM hosts GROUP BY protocol;` shows mostly https
|
|
- `SELECT * FROM hosts LIMIT 5;` shows sane data (real hostnames, valid WARC paths)
|
|
- Spot-check: pick a few hostnames, verify they're real websites
|
|
|
|
**Stats to emit:** `stats/01_cc_index.json` with total_domains, https_count, http_count, query_time_seconds.
|
|
|
|
**Done when:** 100K hosts in the database with valid WARC coordinates.
|
|
|
|
### Step 1.3: Validate WARC Coordinates
|
|
|
|
Quick sanity check — before writing the full WARC parser, confirm we can actually fetch WARC records:
|
|
|
|
```bash
|
|
# Pick a random row
|
|
psql -c "SELECT warc_filename, warc_record_offset, warc_record_length FROM hosts ORDER BY random() LIMIT 1;"
|
|
|
|
# Fetch it with curl byte-range
|
|
curl -r $OFFSET-$((OFFSET + LENGTH - 1)) "https://data.commoncrawl.org/$WARC_FILENAME" | head -c 500
|
|
```
|
|
|
|
Should see a WARC record header followed by HTTP response headers and HTML.
|
|
|
|
**Done when:** We can manually fetch 3-5 WARC records and see valid HTML content.
|
|
|
|
---
|
|
|
|
## Phase 2: WARC Parsing (Stage 2)
|
|
|
|
### Step 2.1: Go Project Setup
|
|
|
|
Set up the shared Go module and the WARC parser binary:
|
|
|
|
```
|
|
pipeline/02_warc_parse/
|
|
├── main.go # Entry point, CLI flags, orchestration
|
|
├── warc.go # WARC record fetching (S3 byte-range)
|
|
├── parser.go # HTML parsing (title, link rel=icon, iframe headers)
|
|
└── db.go # Postgres batch read/write
|
|
```
|
|
|
|
Dependencies:
|
|
- `github.com/jackc/pgx/v5` — Postgres driver (pool, batch operations)
|
|
- `golang.org/x/net/html` — Lenient HTML parser
|
|
- Standard library `net/http` for S3 byte-range requests
|
|
|
|
CLI flags:
|
|
- `--db` connection string
|
|
- `--batch-size` (default 500)
|
|
- `--concurrency` (default 1000)
|
|
- `--dry-run` (print parsed results, don't write to DB)
|
|
- `--limit` (process at most N rows, for testing)
|
|
|
|
**Done when:** Project compiles, connects to DB, can read a batch of hosts rows.
|
|
|
|
### Step 2.2: WARC Fetch + Parse Logic
|
|
|
|
Implement:
|
|
1. Byte-range fetch from `https://data.commoncrawl.org/{warc_filename}`
|
|
2. Parse WARC record envelope (find the HTTP response within)
|
|
3. Extract HTTP response headers:
|
|
- `X-Frame-Options` → if present and not `ALLOWALL`, iframe_allowed = false
|
|
- `Content-Security-Policy` → check for `frame-ancestors` directive
|
|
4. Parse HTML body:
|
|
- Extract `<title>` content (first title tag, truncate at 512 chars)
|
|
- Extract all `<link rel="icon">` and `<link rel="shortcut icon">`:
|
|
- href (resolve relative URLs against `{protocol}://{hostname}/`)
|
|
- type attribute (if present)
|
|
- sizes attribute (if present)
|
|
- Ignore data: URIs, ignore links to other domains' icons for now
|
|
|
|
**Dry-run test:** Run with `--limit 100 --dry-run` and inspect output. Check:
|
|
- Titles look reasonable (not empty, not garbage)
|
|
- Icon URLs are well-formed (absolute, correct protocol)
|
|
- iframe_allowed is set correctly (spot-check against real sites)
|
|
|
|
**Done when:** Can parse 100 WARC records correctly with `--dry-run` showing reasonable output.
|
|
|
|
### Step 2.3: Batch DB Writes + Full 100K Run
|
|
|
|
Implement the database write path:
|
|
1. For each parsed host: UPDATE hosts SET html_title, iframe_allowed, parsed = TRUE
|
|
2. For each host: INSERT INTO icons (host_id, url, source='favicon_ico') for `/favicon.ico`
|
|
3. For each discovered link rel=icon: INSERT INTO icons (host_id, url, source='link_rel', rel_type, rel_sizes)
|
|
4. Use batch/bulk operations (pgx CopyFrom or batch INSERT)
|
|
|
|
Run against the full 100K hosts:
|
|
- Monitor throughput (hosts/sec)
|
|
- Watch for errors (log to stderr)
|
|
|
|
**Validation:**
|
|
- `SELECT COUNT(*) FROM hosts WHERE parsed = TRUE;` should approach 100,000
|
|
- `SELECT COUNT(*) FROM icons;` should be > 100,000 (at minimum one /favicon.ico per host)
|
|
- `SELECT COUNT(*) FROM hosts WHERE html_title IS NOT NULL;` — expect 90%+
|
|
- `SELECT source, COUNT(*) FROM icons GROUP BY source;` — see the split
|
|
- `SELECT COUNT(*) FROM hosts WHERE iframe_allowed = FALSE;` — expect 30-50%
|
|
- Spot-check: pick some hosts, verify title matches the actual site
|
|
|
|
**Stats:** `stats/02_warc_parse.json`
|
|
|
|
**Done when:** All 100K hosts parsed, icons table populated, stats look reasonable.
|
|
|
|
---
|
|
|
|
## Phase 3: Icon Download (Stage 3)
|
|
|
|
### Step 3.1: Icon Downloader Go Program
|
|
|
|
```
|
|
pipeline/03_icon_download/
|
|
├── main.go # Entry point, CLI flags, worker pool
|
|
├── downloader.go # HTTP fetch with timeouts, size limits
|
|
├── decoder.go # Image validation + dimension extraction
|
|
├── s3.go # Upload to everytab-icons bucket
|
|
└── db.go # Claim work, update results
|
|
```
|
|
|
|
CLI flags:
|
|
- `--db` connection string
|
|
- `--s3-bucket` (default `everytab-icons`)
|
|
- `--concurrency` (default 1000, tunable)
|
|
- `--batch-size` (default 500)
|
|
- `--timeout` (default 10s)
|
|
- `--max-size` (default 512KB)
|
|
- `--dry-run` (fetch and validate but don't upload to S3 or update DB)
|
|
- `--limit` (process at most N icons)
|
|
|
|
Dependencies:
|
|
- `github.com/jackc/pgx/v5` — Postgres
|
|
- `github.com/aws/aws-sdk-go-v2` — S3 uploads
|
|
- Standard library `image` + sub-packages for decoding dimensions
|
|
- A library for ICO parsing (e.g., `github.com/AvraamMavridis/randomcolor` — actually find a proper ICO decoder, or write a simple one that reads the ICO header for directory entries)
|
|
|
|
### Step 3.2: Work Claiming + Download Logic
|
|
|
|
Implement:
|
|
1. Claim batch with randomized order (md5 shuffle, FOR UPDATE SKIP LOCKED)
|
|
2. For each icon URL:
|
|
- HTTP GET with timeouts (5s dial, 10s total)
|
|
- Read up to max-size bytes, abort if exceeded
|
|
- Validate magic bytes (PNG: `\x89PNG`, GIF: `GIF8`, ICO: `\x00\x00\x01\x00`, etc.)
|
|
- Determine actual content type from magic bytes (don't trust HTTP Content-Type)
|
|
- Decode dimensions:
|
|
- PNG/GIF/JPEG/WebP/BMP: read image header (Go `image.DecodeConfig`)
|
|
- ICO: parse directory entries, find largest at standard size ≤64x64
|
|
- SVG: set width=NULL, height=NULL
|
|
- Compute SHA-256 of full content
|
|
- Check if S3 key exists (HEAD request); if yes, skip upload (dedup)
|
|
- Upload to S3 if new
|
|
3. Update icons row with results (or error)
|
|
|
|
**Dry-run test:** `--limit 200 --dry-run` — prints what it would do for 200 icons. Check URLs, detected types, dimensions.
|
|
|
|
**Done when:** Can download, validate, and upload icons for a small batch.
|
|
|
|
### Step 3.3: Full 100K Icon Run
|
|
|
|
Run against all icons in the database (likely 150K-300K icon rows for 100K hosts).
|
|
|
|
Monitor:
|
|
- icons/sec throughput
|
|
- Error breakdown (DNS failures, timeouts, HTTP errors, invalid images)
|
|
- S3 dedup hit rate
|
|
- Memory usage (adjust concurrency if needed)
|
|
|
|
**Validation:**
|
|
- `SELECT scan_state, COUNT(*) FROM icons GROUP BY scan_state;` — expect mostly completed, some failed
|
|
- `SELECT error, COUNT(*) FROM icons WHERE scan_state = 'failed' GROUP BY error ORDER BY count DESC LIMIT 20;` — understand failure modes
|
|
- `aws s3 ls s3://everytab-icons/ | wc -l` — confirm icons in S3
|
|
- Spot-check: download a few icons from S3, open them, verify they're valid images
|
|
|
|
**Stats:** `stats/03_icon_download.json`
|
|
|
|
**Done when:** Icon download complete for 100K dev set, error rate understood, S3 populated.
|
|
|
|
---
|
|
|
|
## Phase 4: Best Icon Selection & Bundle Generation (Stages 4-5)
|
|
|
|
### Step 4.1: Best Icon Selection SQL
|
|
|
|
Write `pipeline/04_best_icon/select.sql`:
|
|
|
|
```sql
|
|
UPDATE hosts h SET best_icon_s3_key = sub.s3_key
|
|
FROM (
|
|
SELECT DISTINCT ON (i.host_id) i.host_id, i.s3_key
|
|
FROM icons i
|
|
WHERE i.scan_state = 'completed'
|
|
ORDER BY i.host_id,
|
|
CASE
|
|
WHEN i.width = i.height AND i.width IN (64, 48, 32, 16) THEN 0
|
|
WHEN i.width = i.height AND i.width <= 64 THEN 1
|
|
WHEN i.width IS NOT NULL AND i.width <= 64 AND i.height <= 64 THEN 2
|
|
ELSE 3
|
|
END,
|
|
COALESCE(i.width, 0) DESC,
|
|
CASE
|
|
WHEN i.content_type IN ('image/png', 'image/gif', 'image/x-icon', 'image/vnd.microsoft.icon') THEN 0
|
|
WHEN i.content_type = 'image/webp' THEN 1
|
|
WHEN i.content_type = 'image/svg+xml' THEN 2
|
|
ELSE 3
|
|
END,
|
|
i.file_size ASC
|
|
) sub
|
|
WHERE h.id = sub.host_id;
|
|
```
|
|
|
|
**Validation:**
|
|
- `SELECT COUNT(*) FROM hosts WHERE best_icon_s3_key IS NOT NULL;` — expect 60-80% of hosts
|
|
- `SELECT COUNT(*) FROM hosts WHERE html_title IS NOT NULL AND best_icon_s3_key IS NULL;` — hosts with title but no icon (will still be in bundles)
|
|
- Spot-check: for a few hosts, verify the selected icon is reasonable (correct size, valid)
|
|
|
|
**Stats:** `stats/04_best_icon.json`
|
|
|
|
**Done when:** best_icon_s3_key populated for hosts that have valid icons.
|
|
|
|
### Step 4.2: Bundle Generator Go Program
|
|
|
|
```
|
|
pipeline/05_bundle_gen/
|
|
├── main.go # Entry point, CLI flags
|
|
├── db.go # Query hosts + icon keys
|
|
├── convert.go # Icon format conversion → PNG
|
|
├── bundle.go # Chunk + serialize JSON
|
|
└── s3.go # Upload bundles to everytab-site
|
|
```
|
|
|
|
CLI flags:
|
|
- `--db` connection string
|
|
- `--icons-bucket` (default `everytab-icons`)
|
|
- `--site-bucket` (default `everytab-site`)
|
|
- `--entries-per-bundle` (tunable, start at 120)
|
|
- `--dry-run` (generate bundles to local disk, don't upload)
|
|
- `--limit` (only process N hosts, for testing)
|
|
|
|
### Step 4.3: Icon Conversion Logic
|
|
|
|
Implement format conversion to PNG:
|
|
1. Download icon from S3 by key
|
|
2. Detect format from magic bytes
|
|
3. Decode:
|
|
- PNG: decode directly
|
|
- ICO: parse container, extract image at recorded width/height, decode BMP or PNG within
|
|
- GIF/JPEG/BMP/WebP: decode to RGBA
|
|
- SVG: rasterize to 32x32 (use a Go SVG library, or shell out to `rsvg-convert` if simpler)
|
|
4. Re-encode as PNG (optimized, don't upscale)
|
|
5. Base64-encode
|
|
|
|
**Test:** Convert 50 icons of mixed formats manually, verify output PNGs look correct.
|
|
|
|
### Step 4.4: Bundle Assembly + Upload
|
|
|
|
Implement:
|
|
1. Query all hosts WHERE html_title IS NOT NULL, randomize (ORDER BY random())
|
|
2. For each host: fetch + convert its icon (or set empty string if no icon)
|
|
3. Assemble entries into chunks of `ENTRIES_PER_BUNDLE`
|
|
4. Serialize each chunk as JSON (`tabs/{n}.json`)
|
|
5. Upload to S3 `everytab-site/tabs/`
|
|
6. Record total bundle count
|
|
|
|
**Dry-run:** Generate bundles to local disk, inspect a few:
|
|
- Valid JSON
|
|
- Icons render in browser (paste a data:image/png;base64,... URI)
|
|
- Entries have host, title, icon, icon_w, icon_h, iframe_ok
|
|
|
|
**Validation:**
|
|
- Bundle files exist in S3
|
|
- `aws s3 ls s3://everytab-site/tabs/ | wc -l` matches expected count
|
|
- Random bundle can be fetched and parsed as JSON
|
|
- Total hosts across all bundles = count of hosts with titles
|
|
|
|
**Stats:** `stats/05_bundle_gen.json`
|
|
|
|
**Done when:** All bundles uploaded to S3, JSON is valid, icons render.
|
|
|
|
---
|
|
|
|
## Phase 5: Frontend (Stage 6)
|
|
|
|
This phase can begin in parallel with Phase 3-4 using mock bundle data.
|
|
|
|
### Step 5.1: Mock Data for Frontend Dev
|
|
|
|
Generate 2-3 small mock bundle files (`tabs/0.json`, `tabs/1.json`, `tabs/2.json`) with ~20 entries each. Use real favicons (Google, GitHub, Wikipedia, etc.) manually base64-encoded. This lets us develop the frontend without waiting for the pipeline.
|
|
|
|
Serve locally with any static file server (`python -m http.server`).
|
|
|
|
**Done when:** Mock bundles exist and can be served locally.
|
|
|
|
### Step 5.2: Basic Tab Rendering
|
|
|
|
Build `frontend/index.html` and `frontend/site.js`:
|
|
|
|
1. HTML: minimal shell with a container div, inline CSS for tab styling
|
|
2. JS: fetch a bundle, render tabs as rows filling the viewport
|
|
3. Tab appearance: mimic Firefox tab shape (rounded top corners, slight border)
|
|
4. Each tab shows favicon (16x16 or 32x32 img from data URI) + truncated title
|
|
5. No-icon tabs show title only
|
|
|
|
Focus: get the visual density right. How many tabs fit across? How many rows fill the viewport? This determines `ENTRIES_PER_BUNDLE`.
|
|
|
|
**Done when:** Page renders tabs from a mock bundle. Visually looks like a page full of browser tabs.
|
|
|
|
### Step 5.3: Marquee Animation
|
|
|
|
Add horizontal marquee to each row:
|
|
- CSS `@keyframes` animation, translateX
|
|
- Each row at slightly different speed and direction (some left, some right)
|
|
- Smooth, subtle movement — not distracting, just enough to feel alive
|
|
- Rows need extra tabs beyond viewport width to avoid gaps during scroll
|
|
|
|
**Done when:** Rows scroll smoothly, no visual glitches at edges.
|
|
|
|
### Step 5.4: Interaction — Click, Iframe, Close
|
|
|
|
Implement tab click behavior:
|
|
1. If `iframe_ok`: show an overlay with iframe loading the site (`{protocol}://{hostname}`)
|
|
2. If `!iframe_ok`: open in new tab (`target="_blank"`, add rel="noopener")
|
|
3. Visual indicator on tabs that will open externally (small icon/badge)
|
|
4. Close overlay: X button + click-outside + Escape key
|
|
|
|
**Done when:** Clicking tabs works correctly for both iframe and external cases.
|
|
|
|
### Step 5.5: Infinite Scroll + Random Bundle Loading
|
|
|
|
Implement:
|
|
1. Seeded PRNG using `Date.now()` — generates deterministic sequence of bundle indices
|
|
2. On page load: fetch first bundle, render
|
|
3. Scroll detection: when user approaches bottom, fetch next random bundle
|
|
4. Track loaded bundle IDs in a Set (no duplicates)
|
|
5. Append new rows below existing ones
|
|
6. Handle edge case: all bundles loaded (unlikely with 50K+ bundles but handle gracefully)
|
|
|
|
`TOTAL_BUNDLES` is a constant baked into the JS at build time.
|
|
|
|
**Done when:** Infinite scroll works, new bundles load seamlessly, no duplicate bundles.
|
|
|
|
### Step 5.6: Frontend Build Script
|
|
|
|
Write `pipeline/06_frontend/build.sh`:
|
|
1. Read total bundle count (from pipeline output or S3)
|
|
2. Inject `const TOTAL_BUNDLES = {M};` into site.js
|
|
3. Copy index.html + site.js to S3 `everytab-site/`
|
|
4. Invalidate CloudFront (if distribution exists)
|
|
|
|
**Done when:** Build script produces deployable frontend with correct bundle count.
|
|
|
|
---
|
|
|
|
## Phase 6: Integration & End-to-End Test (100K)
|
|
|
|
### Step 6.1: Run Full Pipeline (100K)
|
|
|
|
Execute all stages in sequence on EC2:
|
|
1. Verify hosts table has 100K entries (from Phase 1)
|
|
2. Run WARC parser (Phase 2) — should complete in minutes
|
|
3. Run icon downloader (Phase 3) — should complete in 10-30 minutes at 100K scale
|
|
4. Run best icon selection (Phase 4.1)
|
|
5. Run bundle generator (Phase 4.2-4.4)
|
|
6. Run frontend build (Phase 5.6)
|
|
|
|
**Validation:** Visit the CloudFront URL. The site should work:
|
|
- Tabs render with real favicons and titles
|
|
- Clicking works (iframe + external)
|
|
- Scrolling loads more tabs
|
|
- No JS console errors
|
|
|
|
### Step 6.2: Tune Parameters
|
|
|
|
Based on the 100K run:
|
|
- **ENTRIES_PER_BUNDLE:** Look at the live site. Does one bundle fill the screen? Too many tabs? Too few? Adjust.
|
|
- **Concurrency:** Was the icon download memory-stable? CPU-bound or network-bound? Adjust goroutine pool size.
|
|
- **Timeouts:** What was the error distribution? Are timeouts too aggressive? Too lenient?
|
|
- **Icon selection:** Do the selected icons look good? Any weird sizes or broken images?
|
|
|
|
Update CLI flag defaults based on findings.
|
|
|
|
### Step 6.3: Collect & Review Stats
|
|
|
|
Merge all `stats/*.json` into a single pipeline report. Review:
|
|
- Loss at each stage (domains → parsed → icons downloaded → icons selected → bundled)
|
|
- Time per stage
|
|
- Error patterns (are certain TLDs failing more? certain icon formats?)
|
|
- Storage usage (S3 icons bucket, S3 site bucket)
|
|
|
|
Identify any pipeline bugs or data quality issues. Fix before scaling up.
|
|
|
|
**Done when:** End-to-end works at 100K, parameters tuned, stats reviewed, bugs fixed.
|
|
|
|
---
|
|
|
|
## Phase 7: Full-Scale Run (30M)
|
|
|
|
### Step 7.1: Remove Limits, Re-run CC-Index Query
|
|
|
|
Update the DuckDB query to remove `LIMIT 100000`. Re-run.
|
|
|
|
Considerations:
|
|
- If httpfs takes >1hr, switch to downloading the parquet files first
|
|
- May need to increase RDS storage (30M rows with WARC paths ≈ 5-10GB)
|
|
- Monitor DuckDB memory usage
|
|
|
|
**Validation:** `SELECT COUNT(*) FROM hosts;` shows ~30M rows.
|
|
|
|
### Step 7.2: Run WARC Parser at Scale
|
|
|
|
Run with full concurrency against 30M hosts. Expected time: 2-6 hours.
|
|
|
|
Monitor:
|
|
- Throughput (hosts/sec)
|
|
- Error rate stability (should plateau, not climb)
|
|
- Postgres connection pool health
|
|
- Memory usage
|
|
|
|
### Step 7.3: Run Icon Downloader at Scale
|
|
|
|
This is the long pole — expected 12-48 hours.
|
|
|
|
Monitor continuously:
|
|
- icons/sec rate
|
|
- DNS cache hit rate (check Unbound stats: `unbound-control stats`)
|
|
- S3 upload rate
|
|
- Error rate by type
|
|
- Completion percentage
|
|
|
|
If too slow (projected >48hrs):
|
|
- Consider increasing concurrency (if memory allows)
|
|
- Consider spinning up fleet (add more EC2 instances running the same binary)
|
|
- Check if DNS is the bottleneck (Unbound stats)
|
|
- Check if S3 uploads are the bottleneck (batch or reduce HEAD checks)
|
|
|
|
### Step 7.4: Best Icon Selection + Bundle Generation
|
|
|
|
Run at full scale. Expected: 1-2 hours total.
|
|
|
|
Monitor bundle sizes — verify they're in the expected range with `ENTRIES_PER_BUNDLE` from tuning.
|
|
|
|
### Step 7.5: Rebuild Frontend + Deploy
|
|
|
|
Run frontend build with the real bundle count. Invalidate CloudFront.
|
|
|
|
**Validation:** Visit the live site. Browse around. Check:
|
|
- Tab variety (seeing diverse sites, not just one TLD)
|
|
- Icon quality (no broken images, reasonable sizes)
|
|
- Performance (bundles load quickly, no jank)
|
|
- Stats page / stats.json looks correct
|
|
|
|
**Done when:** Full-scale site is live and working.
|
|
|
|
---
|
|
|
|
## Phase 8: Backup & Teardown
|
|
|
|
### Step 8.1: Backup RDS to Homelab
|
|
|
|
```bash
|
|
# On EC2 (fast connection to RDS):
|
|
pg_dump -Fc $DATABASE_URL > everytab_dump.pgfc
|
|
|
|
# Transfer to homelab (from EC2 or direct):
|
|
scp everytab_dump.pgfc homelab:/backups/everytab/
|
|
|
|
# On homelab, verify restore:
|
|
pg_restore -d everytab_local everytab_dump.pgfc
|
|
psql everytab_local -c "SELECT COUNT(*) FROM hosts; SELECT COUNT(*) FROM icons;"
|
|
```
|
|
|
|
### Step 8.2: Backup Icons S3 to Homelab
|
|
|
|
```bash
|
|
# From homelab (or EC2 as intermediary):
|
|
aws s3 sync s3://everytab-icons/ /backups/everytab/icons/
|
|
|
|
# Verify file count matches:
|
|
ls /backups/everytab/icons/ | wc -l
|
|
# Compare with: aws s3 ls s3://everytab-icons/ | wc -l
|
|
```
|
|
|
|
### Step 8.3: Verify & Teardown
|
|
|
|
After confirming backups:
|
|
|
|
```bash
|
|
# Verify the live site still works (it only depends on everytab-site + CloudFront)
|
|
curl -s https://your-cloudfront-domain.net/ | head
|
|
|
|
# Teardown scanning infrastructure:
|
|
aws rds delete-db-instance --db-instance-identifier everytab --skip-final-snapshot
|
|
aws s3 rb s3://everytab-icons --force
|
|
aws ec2 terminate-instances --instance-ids i-xxxxx
|
|
```
|
|
|
|
**Done when:** Only `everytab-site` S3 bucket + CloudFront remain running. Monthly cost: ~$2-4.
|
|
|
|
---
|
|
|
|
## Development Notes
|
|
|
|
### What Can Be Parallelized
|
|
|
|
- **Frontend dev (Phase 5.1-5.5)** can happen at any time using mock data
|
|
- **AWS infra setup (Phase 0.2)** can happen while writing code locally
|
|
- **Icon downloader (Phase 3)** and **bundle generator (Phase 4)** are independent codebases, can be written in parallel
|
|
|
|
### Testing Strategy
|
|
|
|
- **Dry-run flags** on all Go programs: print what would happen without mutating DB/S3
|
|
- **--limit flags** on all Go programs: process a small subset quickly
|
|
- **Spot-checks:** after each stage, manually verify 5-10 random entries
|
|
- **Stats files:** compare counts between stages to catch data loss
|
|
- **100K dev set:** full pipeline at small scale before committing to a 24hr+ full run
|
|
|
|
### Common Pitfalls to Watch For
|
|
|
|
- **DuckDB CC-Index path:** The exact S3 path to parquet files changes per crawl. Check Common Crawl's website for the latest crawl ID and index location.
|
|
- **WARC record format:** WARC records have a specific envelope format (WARC/1.0 header, blank line, HTTP response). Don't assume the HTTP response starts at byte 0.
|
|
- **Relative icon URLs:** `/favicon.ico` is relative to root, but `favicon.ico` (no leading slash) is relative to the page path. Since we only have root pages (`/`), both resolve the same. But `../icons/fav.png` could be tricky — handle gracefully or skip.
|
|
- **ICO files are complex:** The ICO container format can embed BMP (with a modified header) or PNG. Many "ICO" files are actually just PNGs renamed to .ico. Check magic bytes, not file extension.
|
|
- **SVG rasterization:** Go doesn't have great native SVG support. Consider shelling out to `rsvg-convert` or `librsvg`, or use a Go library like `github.com/nicholasgasior/goresvg`. This can be a follow-up if SVG icons are rare.
|
|
- **Postgres connection limits:** RDS db.t3.medium has max_connections ≈ 80. With 1000 goroutines, we need connection pooling (pgx pool handles this). Set pool max to ~40 connections.
|
|
- **S3 eventual consistency:** After uploading an icon, a HEAD request might not find it immediately. For dedup checks, handle "not found" gracefully (just upload again — idempotent since key is content hash).
|
|
- **CloudFront caching:** After deploying new bundles, invalidate `/*` or set short TTL during development. For production, use long TTLs (bundles are immutable between crawls).
|