26 KiB
EveryTab Implementation Plan
This plan builds the system described in ARCHITECTURE.md in incremental steps. We start with 100K hosts to validate the pipeline end-to-end, then scale to the full ~30M.
Each step has a clear deliverable and validation criteria. Steps within a phase are sequential; some phases can overlap (noted where applicable).
Phase 0: Project Setup & AWS Infrastructure
Step 0.1: Repository Structure
Create the project layout:
everytab/
├── design.md
├── ARCHITECTURE.md
├── PLAN.md
├── infra/ # AWS CLI scripts for setup/teardown
│ ├── setup.sh # Create RDS, S3 buckets, security groups
│ ├── teardown.sh # Delete non-permanent resources
│ └── ec2-userdata.sh # EC2 bootstrap (install Go, DuckDB, Unbound)
├── pipeline/
│ ├── 01_cc_index/ # DuckDB query scripts
│ ├── 02_warc_parse/ # Go program
│ ├── 03_icon_download/# Go program
│ ├── 04_best_icon/ # SQL script
│ ├── 05_bundle_gen/ # Go program
│ └── 06_frontend/ # Build script, templates
├── frontend/
│ ├── index.html
│ └── site.js
├── stats/ # Stats output from each stage (gitignored)
└── go.mod # Shared Go module for pipeline programs
Done when: Repo structure exists, go.mod initialized, .gitignore covers stats/ and any local config.
Step 0.2: AWS Infrastructure (Manual CLI)
Create resources using AWS CLI commands in infra/setup.sh:
-
S3 buckets:
everytab-icons(private, no public access)everytab-site(private, accessed via CloudFront OAC)
-
RDS Postgres:
db.t3.medium, 20GB storage (expandable), Postgres 16- In a VPC, security group allows inbound 5432 from EC2 security group
- No public access (EC2 connects within VPC)
- No multi-AZ (dev, not production)
- Set a strong password, store in a local
.env(gitignored)
-
EC2 instance:
c5.xlarge(4 vCPU, 8GB RAM) — enough for Go concurrency + Unbound cache- Amazon Linux 2023 or Ubuntu 24.04
- Security group: allow SSH (from your IP), allow outbound all
- Same VPC/subnet as RDS
- Key pair for SSH access
-
CloudFront distribution:
- Origin:
everytab-siteS3 bucket (OAC) - Default cache behavior: cache everything, Brotli+Gzip compression
- Can set up now or defer to Phase 2
- Origin:
-
IAM role for EC2:
- S3 read/write to both buckets
- Attach as instance profile
Validation: SSH into EC2, confirm psql can connect to RDS, confirm aws s3 ls shows both buckets.
Done when: All resources exist, EC2 can reach RDS and S3.
Step 0.3: EC2 Environment Setup
Bootstrap script (infra/ec2-userdata.sh or run manually):
- Install Go (latest stable, 1.22+)
- Install DuckDB CLI
- Install Unbound, configure as recursive resolver:
/etc/unbound/unbound.conf: recursive mode, no forwarding, listen on 127.0.0.1- High cache:
msg-cache-size: 512m,rrset-cache-size: 1g cache-min-ttl: 3600prefetch: yesnum-threads: 4
- Set
/etc/resolv.conf→nameserver 127.0.0.1 - Install
psqlclient,pg_dump - Confirm DuckDB httpfs extension works:
INSTALL httpfs; LOAD httpfs;
Validation:
go versionworksduckdb -c "INSTALL httpfs; LOAD httpfs; SELECT 1;"worksdig example.com @127.0.0.1resolves (Unbound working)psql $DATABASE_URL -c "SELECT 1;"connects to RDS
Done when: EC2 is a working development environment for all pipeline stages.
Phase 1: CC-Index Query (Stage 1)
Step 1.1: Database Schema
Create the Postgres tables. Run via psql:
CREATE TABLE hosts (
id SERIAL PRIMARY KEY,
hostname TEXT NOT NULL UNIQUE,
protocol TEXT NOT NULL,
crawl_id TEXT NOT NULL,
warc_filename TEXT NOT NULL,
warc_record_offset BIGINT NOT NULL,
warc_record_length INT NOT NULL,
html_title TEXT,
iframe_allowed BOOLEAN,
best_icon_s3_key TEXT,
parsed BOOLEAN DEFAULT FALSE
);
CREATE TABLE icons (
id SERIAL PRIMARY KEY,
host_id INT NOT NULL REFERENCES hosts(id),
url TEXT NOT NULL,
source TEXT NOT NULL,
rel_type TEXT,
rel_sizes TEXT,
content_type TEXT,
width INT,
height INT,
file_size INT,
s3_key TEXT,
scan_state TEXT DEFAULT 'unscanned',
error TEXT
);
CREATE INDEX idx_hosts_parsed ON hosts(id) WHERE parsed = FALSE;
CREATE INDEX idx_icons_unscanned ON icons(id) WHERE scan_state = 'unscanned';
CREATE INDEX idx_icons_host_id ON icons(host_id);
Done when: Tables exist in RDS, schema matches ARCHITECTURE.md.
Step 1.2: DuckDB CC-Index Query (100K limit)
Write pipeline/01_cc_index/query.sql (or a shell script wrapping DuckDB CLI).
The script:
- Connects DuckDB to RDS via the postgres extension
- Queries the CC-Index parquet files via httpfs (latest crawl)
- Filters per ARCHITECTURE.md criteria
- Deduplicates per hostname (prefer https)
- Limits to 100,000 rows for dev
- Inserts directly into the hosts table
Key considerations:
- Find the latest crawl index path (e.g.,
s3://commoncrawl/cc-index/collections/CC-MAIN-2026-05/indexes/cdx-00*.parquet— verify actual path structure) - DuckDB postgres extension:
INSTALL postgres; LOAD postgres; ATTACH 'dbname=... host=... ...' AS pg (TYPE POSTGRES); - The dedup logic: partition by hostname, order by protocol (https first), take first row
- Add
LIMIT 100000for dev, remove for full run - Time the query — if httpfs takes >1hr, switch to downloading parquet first
Validation:
SELECT COUNT(*) FROM hosts;returns ~100,000SELECT protocol, COUNT(*) FROM hosts GROUP BY protocol;shows mostly httpsSELECT * FROM hosts LIMIT 5;shows sane data (real hostnames, valid WARC paths)- Spot-check: pick a few hostnames, verify they're real websites
Stats to emit: stats/01_cc_index.json with total_domains, https_count, http_count, query_time_seconds.
Done when: 100K hosts in the database with valid WARC coordinates.
Step 1.3: Validate WARC Coordinates
Quick sanity check — before writing the full WARC parser, confirm we can actually fetch WARC records:
# Pick a random row
psql -c "SELECT warc_filename, warc_record_offset, warc_record_length FROM hosts ORDER BY random() LIMIT 1;"
# Fetch it with curl byte-range
curl -r $OFFSET-$((OFFSET + LENGTH - 1)) "https://data.commoncrawl.org/$WARC_FILENAME" | head -c 500
Should see a WARC record header followed by HTTP response headers and HTML.
Done when: We can manually fetch 3-5 WARC records and see valid HTML content.
Phase 2: WARC Parsing (Stage 2)
Step 2.1: Go Project Setup
Set up the shared Go module and the WARC parser binary:
pipeline/02_warc_parse/
├── main.go # Entry point, CLI flags, orchestration
├── warc.go # WARC record fetching (S3 byte-range)
├── parser.go # HTML parsing (title, link rel=icon, iframe headers)
└── db.go # Postgres batch read/write
Dependencies:
github.com/jackc/pgx/v5— Postgres driver (pool, batch operations)golang.org/x/net/html— Lenient HTML parser- Standard library
net/httpfor S3 byte-range requests
CLI flags:
--dbconnection string--batch-size(default 500)--concurrency(default 1000)--dry-run(print parsed results, don't write to DB)--limit(process at most N rows, for testing)
Done when: Project compiles, connects to DB, can read a batch of hosts rows.
Step 2.2: WARC Fetch + Parse Logic
Implement:
- Byte-range fetch from
https://data.commoncrawl.org/{warc_filename} - Parse WARC record envelope (find the HTTP response within)
- Extract HTTP response headers:
X-Frame-Options→ if present and notALLOWALL, iframe_allowed = falseContent-Security-Policy→ check forframe-ancestorsdirective
- Parse HTML body:
- Extract
<title>content (first title tag, truncate at 512 chars) - Extract all
<link rel="icon">and<link rel="shortcut icon">:- href (resolve relative URLs against
{protocol}://{hostname}/) - type attribute (if present)
- sizes attribute (if present)
- href (resolve relative URLs against
- Ignore data: URIs, ignore links to other domains' icons for now
- Extract
Dry-run test: Run with --limit 100 --dry-run and inspect output. Check:
- Titles look reasonable (not empty, not garbage)
- Icon URLs are well-formed (absolute, correct protocol)
- iframe_allowed is set correctly (spot-check against real sites)
Done when: Can parse 100 WARC records correctly with --dry-run showing reasonable output.
Step 2.3: Batch DB Writes + Full 100K Run
Implement the database write path:
- For each parsed host: UPDATE hosts SET html_title, iframe_allowed, parsed = TRUE
- For each host: INSERT INTO icons (host_id, url, source='favicon_ico') for
/favicon.ico - For each discovered link rel=icon: INSERT INTO icons (host_id, url, source='link_rel', rel_type, rel_sizes)
- Use batch/bulk operations (pgx CopyFrom or batch INSERT)
Run against the full 100K hosts:
- Monitor throughput (hosts/sec)
- Watch for errors (log to stderr)
Validation:
SELECT COUNT(*) FROM hosts WHERE parsed = TRUE;should approach 100,000SELECT COUNT(*) FROM icons;should be > 100,000 (at minimum one /favicon.ico per host)SELECT COUNT(*) FROM hosts WHERE html_title IS NOT NULL;— expect 90%+SELECT source, COUNT(*) FROM icons GROUP BY source;— see the splitSELECT COUNT(*) FROM hosts WHERE iframe_allowed = FALSE;— expect 30-50%- Spot-check: pick some hosts, verify title matches the actual site
Stats: stats/02_warc_parse.json
Done when: All 100K hosts parsed, icons table populated, stats look reasonable.
Phase 3: Icon Download (Stage 3)
Step 3.1: Icon Downloader Go Program
pipeline/03_icon_download/
├── main.go # Entry point, CLI flags, worker pool
├── downloader.go # HTTP fetch with timeouts, size limits
├── decoder.go # Image validation + dimension extraction
├── s3.go # Upload to everytab-icons bucket
└── db.go # Claim work, update results
CLI flags:
--dbconnection string--s3-bucket(defaulteverytab-icons)--concurrency(default 1000, tunable)--batch-size(default 500)--timeout(default 10s)--max-size(default 512KB)--dry-run(fetch and validate but don't upload to S3 or update DB)--limit(process at most N icons)
Dependencies:
github.com/jackc/pgx/v5— Postgresgithub.com/aws/aws-sdk-go-v2— S3 uploads- Standard library
image+ sub-packages for decoding dimensions - A library for ICO parsing (e.g.,
github.com/AvraamMavridis/randomcolor— actually find a proper ICO decoder, or write a simple one that reads the ICO header for directory entries)
Step 3.2: Work Claiming + Download Logic
Implement:
- Claim batch with randomized order (md5 shuffle, FOR UPDATE SKIP LOCKED)
- For each icon URL:
- HTTP GET with timeouts (5s dial, 10s total)
- Read up to max-size bytes, abort if exceeded
- Validate magic bytes (PNG:
\x89PNG, GIF:GIF8, ICO:\x00\x00\x01\x00, etc.) - Determine actual content type from magic bytes (don't trust HTTP Content-Type)
- Decode dimensions:
- PNG/GIF/JPEG/WebP/BMP: read image header (Go
image.DecodeConfig) - ICO: parse directory entries, find largest at standard size ≤64x64
- SVG: set width=NULL, height=NULL
- PNG/GIF/JPEG/WebP/BMP: read image header (Go
- Compute SHA-256 of full content
- Check if S3 key exists (HEAD request); if yes, skip upload (dedup)
- Upload to S3 if new
- Update icons row with results (or error)
Dry-run test: --limit 200 --dry-run — prints what it would do for 200 icons. Check URLs, detected types, dimensions.
Done when: Can download, validate, and upload icons for a small batch.
Step 3.3: Full 100K Icon Run
Run against all icons in the database (likely 150K-300K icon rows for 100K hosts).
Monitor:
- icons/sec throughput
- Error breakdown (DNS failures, timeouts, HTTP errors, invalid images)
- S3 dedup hit rate
- Memory usage (adjust concurrency if needed)
Validation:
SELECT scan_state, COUNT(*) FROM icons GROUP BY scan_state;— expect mostly completed, some failedSELECT error, COUNT(*) FROM icons WHERE scan_state = 'failed' GROUP BY error ORDER BY count DESC LIMIT 20;— understand failure modesaws s3 ls s3://everytab-icons/ | wc -l— confirm icons in S3- Spot-check: download a few icons from S3, open them, verify they're valid images
Stats: stats/03_icon_download.json
Done when: Icon download complete for 100K dev set, error rate understood, S3 populated.
Phase 4: Best Icon Selection & Bundle Generation (Stages 4-5)
Step 4.1: Best Icon Selection SQL
Write pipeline/04_best_icon/select.sql:
UPDATE hosts h SET best_icon_s3_key = sub.s3_key
FROM (
SELECT DISTINCT ON (i.host_id) i.host_id, i.s3_key
FROM icons i
WHERE i.scan_state = 'completed'
ORDER BY i.host_id,
CASE
WHEN i.width = i.height AND i.width IN (64, 48, 32, 16) THEN 0
WHEN i.width = i.height AND i.width <= 64 THEN 1
WHEN i.width IS NOT NULL AND i.width <= 64 AND i.height <= 64 THEN 2
ELSE 3
END,
COALESCE(i.width, 0) DESC,
CASE
WHEN i.content_type IN ('image/png', 'image/gif', 'image/x-icon', 'image/vnd.microsoft.icon') THEN 0
WHEN i.content_type = 'image/webp' THEN 1
WHEN i.content_type = 'image/svg+xml' THEN 2
ELSE 3
END,
i.file_size ASC
) sub
WHERE h.id = sub.host_id;
Validation:
SELECT COUNT(*) FROM hosts WHERE best_icon_s3_key IS NOT NULL;— expect 60-80% of hostsSELECT COUNT(*) FROM hosts WHERE html_title IS NOT NULL AND best_icon_s3_key IS NULL;— hosts with title but no icon (will still be in bundles)- Spot-check: for a few hosts, verify the selected icon is reasonable (correct size, valid)
Stats: stats/04_best_icon.json
Done when: best_icon_s3_key populated for hosts that have valid icons.
Step 4.2: Bundle Generator Go Program
pipeline/05_bundle_gen/
├── main.go # Entry point, CLI flags
├── db.go # Query hosts + icon keys
├── convert.go # Icon format conversion → PNG
├── bundle.go # Chunk + serialize JSON
└── s3.go # Upload bundles to everytab-site
CLI flags:
--dbconnection string--icons-bucket(defaulteverytab-icons)--site-bucket(defaulteverytab-site)--entries-per-bundle(tunable, start at 120)--dry-run(generate bundles to local disk, don't upload)--limit(only process N hosts, for testing)
Step 4.3: Icon Conversion Logic
Implement format conversion to PNG:
- Download icon from S3 by key
- Detect format from magic bytes
- Decode:
- PNG: decode directly
- ICO: parse container, extract image at recorded width/height, decode BMP or PNG within
- GIF/JPEG/BMP/WebP: decode to RGBA
- SVG: rasterize to 32x32 (use a Go SVG library, or shell out to
rsvg-convertif simpler)
- Re-encode as PNG (optimized, don't upscale)
- Base64-encode
Test: Convert 50 icons of mixed formats manually, verify output PNGs look correct.
Step 4.4: Bundle Assembly + Upload
Implement:
- Query all hosts WHERE html_title IS NOT NULL, randomize (ORDER BY random())
- For each host: fetch + convert its icon (or set empty string if no icon)
- Assemble entries into chunks of
ENTRIES_PER_BUNDLE - Serialize each chunk as JSON (
tabs/{n}.json) - Upload to S3
everytab-site/tabs/ - Record total bundle count
Dry-run: Generate bundles to local disk, inspect a few:
- Valid JSON
- Icons render in browser (paste a data:image/png;base64,... URI)
- Entries have host, title, icon, icon_w, icon_h, iframe_ok
Validation:
- Bundle files exist in S3
aws s3 ls s3://everytab-site/tabs/ | wc -lmatches expected count- Random bundle can be fetched and parsed as JSON
- Total hosts across all bundles = count of hosts with titles
Stats: stats/05_bundle_gen.json
Done when: All bundles uploaded to S3, JSON is valid, icons render.
Phase 5: Frontend (Stage 6)
This phase can begin in parallel with Phase 3-4 using mock bundle data.
Step 5.1: Mock Data for Frontend Dev
Generate 2-3 small mock bundle files (tabs/0.json, tabs/1.json, tabs/2.json) with ~20 entries each. Use real favicons (Google, GitHub, Wikipedia, etc.) manually base64-encoded. This lets us develop the frontend without waiting for the pipeline.
Serve locally with any static file server (python -m http.server).
Done when: Mock bundles exist and can be served locally.
Step 5.2: Basic Tab Rendering
Build frontend/index.html and frontend/site.js:
- HTML: minimal shell with a container div, inline CSS for tab styling
- JS: fetch a bundle, render tabs as rows filling the viewport
- Tab appearance: mimic Firefox tab shape (rounded top corners, slight border)
- Each tab shows favicon (16x16 or 32x32 img from data URI) + truncated title
- No-icon tabs show title only
Focus: get the visual density right. How many tabs fit across? How many rows fill the viewport? This determines ENTRIES_PER_BUNDLE.
Done when: Page renders tabs from a mock bundle. Visually looks like a page full of browser tabs.
Step 5.3: Marquee Animation
Add horizontal marquee to each row:
- CSS
@keyframesanimation, translateX - Each row at slightly different speed and direction (some left, some right)
- Smooth, subtle movement — not distracting, just enough to feel alive
- Rows need extra tabs beyond viewport width to avoid gaps during scroll
Done when: Rows scroll smoothly, no visual glitches at edges.
Step 5.4: Interaction — Click, Iframe, Close
Implement tab click behavior:
- If
iframe_ok: show an overlay with iframe loading the site ({protocol}://{hostname}) - If
!iframe_ok: open in new tab (target="_blank", add rel="noopener") - Visual indicator on tabs that will open externally (small icon/badge)
- Close overlay: X button + click-outside + Escape key
Done when: Clicking tabs works correctly for both iframe and external cases.
Step 5.5: Infinite Scroll + Random Bundle Loading
Implement:
- Seeded PRNG using
Date.now()— generates deterministic sequence of bundle indices - On page load: fetch first bundle, render
- Scroll detection: when user approaches bottom, fetch next random bundle
- Track loaded bundle IDs in a Set (no duplicates)
- Append new rows below existing ones
- Handle edge case: all bundles loaded (unlikely with 50K+ bundles but handle gracefully)
TOTAL_BUNDLES is a constant baked into the JS at build time.
Done when: Infinite scroll works, new bundles load seamlessly, no duplicate bundles.
Step 5.6: Frontend Build Script
Write pipeline/06_frontend/build.sh:
- Read total bundle count (from pipeline output or S3)
- Inject
const TOTAL_BUNDLES = {M};into site.js - Copy index.html + site.js to S3
everytab-site/ - Invalidate CloudFront (if distribution exists)
Done when: Build script produces deployable frontend with correct bundle count.
Phase 6: Integration & End-to-End Test (100K)
Step 6.1: Run Full Pipeline (100K)
Execute all stages in sequence on EC2:
- Verify hosts table has 100K entries (from Phase 1)
- Run WARC parser (Phase 2) — should complete in minutes
- Run icon downloader (Phase 3) — should complete in 10-30 minutes at 100K scale
- Run best icon selection (Phase 4.1)
- Run bundle generator (Phase 4.2-4.4)
- Run frontend build (Phase 5.6)
Validation: Visit the CloudFront URL. The site should work:
- Tabs render with real favicons and titles
- Clicking works (iframe + external)
- Scrolling loads more tabs
- No JS console errors
Step 6.2: Tune Parameters
Based on the 100K run:
- ENTRIES_PER_BUNDLE: Look at the live site. Does one bundle fill the screen? Too many tabs? Too few? Adjust.
- Concurrency: Was the icon download memory-stable? CPU-bound or network-bound? Adjust goroutine pool size.
- Timeouts: What was the error distribution? Are timeouts too aggressive? Too lenient?
- Icon selection: Do the selected icons look good? Any weird sizes or broken images?
Update CLI flag defaults based on findings.
Step 6.3: Collect & Review Stats
Merge all stats/*.json into a single pipeline report. Review:
- Loss at each stage (domains → parsed → icons downloaded → icons selected → bundled)
- Time per stage
- Error patterns (are certain TLDs failing more? certain icon formats?)
- Storage usage (S3 icons bucket, S3 site bucket)
Identify any pipeline bugs or data quality issues. Fix before scaling up.
Done when: End-to-end works at 100K, parameters tuned, stats reviewed, bugs fixed.
Phase 7: Full-Scale Run (30M)
Step 7.1: Remove Limits, Re-run CC-Index Query
Update the DuckDB query to remove LIMIT 100000. Re-run.
Considerations:
- If httpfs takes >1hr, switch to downloading the parquet files first
- May need to increase RDS storage (30M rows with WARC paths ≈ 5-10GB)
- Monitor DuckDB memory usage
Validation: SELECT COUNT(*) FROM hosts; shows ~30M rows.
Step 7.2: Run WARC Parser at Scale
Run with full concurrency against 30M hosts. Expected time: 2-6 hours.
Monitor:
- Throughput (hosts/sec)
- Error rate stability (should plateau, not climb)
- Postgres connection pool health
- Memory usage
Step 7.3: Run Icon Downloader at Scale
This is the long pole — expected 12-48 hours.
Monitor continuously:
- icons/sec rate
- DNS cache hit rate (check Unbound stats:
unbound-control stats) - S3 upload rate
- Error rate by type
- Completion percentage
If too slow (projected >48hrs):
- Consider increasing concurrency (if memory allows)
- Consider spinning up fleet (add more EC2 instances running the same binary)
- Check if DNS is the bottleneck (Unbound stats)
- Check if S3 uploads are the bottleneck (batch or reduce HEAD checks)
Step 7.4: Best Icon Selection + Bundle Generation
Run at full scale. Expected: 1-2 hours total.
Monitor bundle sizes — verify they're in the expected range with ENTRIES_PER_BUNDLE from tuning.
Step 7.5: Rebuild Frontend + Deploy
Run frontend build with the real bundle count. Invalidate CloudFront.
Validation: Visit the live site. Browse around. Check:
- Tab variety (seeing diverse sites, not just one TLD)
- Icon quality (no broken images, reasonable sizes)
- Performance (bundles load quickly, no jank)
- Stats page / stats.json looks correct
Done when: Full-scale site is live and working.
Phase 8: Backup & Teardown
Step 8.1: Backup RDS to Homelab
# On EC2 (fast connection to RDS):
pg_dump -Fc $DATABASE_URL > everytab_dump.pgfc
# Transfer to homelab (from EC2 or direct):
scp everytab_dump.pgfc homelab:/backups/everytab/
# On homelab, verify restore:
pg_restore -d everytab_local everytab_dump.pgfc
psql everytab_local -c "SELECT COUNT(*) FROM hosts; SELECT COUNT(*) FROM icons;"
Step 8.2: Backup Icons S3 to Homelab
# From homelab (or EC2 as intermediary):
aws s3 sync s3://everytab-icons/ /backups/everytab/icons/
# Verify file count matches:
ls /backups/everytab/icons/ | wc -l
# Compare with: aws s3 ls s3://everytab-icons/ | wc -l
Step 8.3: Verify & Teardown
After confirming backups:
# Verify the live site still works (it only depends on everytab-site + CloudFront)
curl -s https://your-cloudfront-domain.net/ | head
# Teardown scanning infrastructure:
aws rds delete-db-instance --db-instance-identifier everytab --skip-final-snapshot
aws s3 rb s3://everytab-icons --force
aws ec2 terminate-instances --instance-ids i-xxxxx
Done when: Only everytab-site S3 bucket + CloudFront remain running. Monthly cost: ~$2-4.
Development Notes
What Can Be Parallelized
- Frontend dev (Phase 5.1-5.5) can happen at any time using mock data
- AWS infra setup (Phase 0.2) can happen while writing code locally
- Icon downloader (Phase 3) and bundle generator (Phase 4) are independent codebases, can be written in parallel
Testing Strategy
- Dry-run flags on all Go programs: print what would happen without mutating DB/S3
- --limit flags on all Go programs: process a small subset quickly
- Spot-checks: after each stage, manually verify 5-10 random entries
- Stats files: compare counts between stages to catch data loss
- 100K dev set: full pipeline at small scale before committing to a 24hr+ full run
Common Pitfalls to Watch For
- DuckDB CC-Index path: The exact S3 path to parquet files changes per crawl. Check Common Crawl's website for the latest crawl ID and index location.
- WARC record format: WARC records have a specific envelope format (WARC/1.0 header, blank line, HTTP response). Don't assume the HTTP response starts at byte 0.
- Relative icon URLs:
/favicon.icois relative to root, butfavicon.ico(no leading slash) is relative to the page path. Since we only have root pages (/), both resolve the same. But../icons/fav.pngcould be tricky — handle gracefully or skip. - ICO files are complex: The ICO container format can embed BMP (with a modified header) or PNG. Many "ICO" files are actually just PNGs renamed to .ico. Check magic bytes, not file extension.
- SVG rasterization: Go doesn't have great native SVG support. Consider shelling out to
rsvg-convertorlibrsvg, or use a Go library likegithub.com/nicholasgasior/goresvg. This can be a follow-up if SVG icons are rare. - Postgres connection limits: RDS db.t3.medium has max_connections ≈ 80. With 1000 goroutines, we need connection pooling (pgx pool handles this). Set pool max to ~40 connections.
- S3 eventual consistency: After uploading an icon, a HEAD request might not find it immediately. For dedup checks, handle "not found" gracefully (just upload again — idempotent since key is content hash).
- CloudFront caching: After deploying new bundles, invalidate
/*or set short TTL during development. For production, use long TTLs (bundles are immutable between crawls).