| .. | ||
| 01_cc_index | ||
| 02_warc_parse | ||
| 03_icon_download | ||
| 04_best_icon | ||
| 05_bundle_gen | ||
| 06_frontend | ||
| README.md | ||
| run.sh | ||
Pipeline
Run these stages in order on the EC2 instance. Each stage is a single command.
Between stages, run the sanity checks to confirm data looks right before proceeding. All stages are idempotent — safe to re-run if interrupted.
Prerequisites
# Postgres on i3 instance (run infra/db-setup.sh on the i3 first)
export DATABASE_URL='postgres://everytab@<i3-private-ip>:5432/everytab'
# Go binaries built on EC2
go build -o ~/warc_parse ./everytab/pipeline/02_warc_parse/
go build -o ~/icon_download ./everytab/pipeline/03_icon_download/
go build -o ~/bundle_gen ./everytab/pipeline/05_bundle_gen/
Stage 1: CC-Index Query
Populates the hosts table from Common Crawl's columnar index.
./everytab/pipeline/01_cc_index/query.sh --db-url "$DATABASE_URL" --limit 100000
# Full run: --limit 0
Stage 2: WARC Parsing
Fetches WARC records from CC's S3, extracts titles, icons, and iframe headers.
./warc_parse --db "$DATABASE_URL" --log-file warc_parse.log --log-errors-only
Stage 3: Icon Download
Downloads favicons from the live web, validates, writes to local disk.
GOMEMLIMIT=12GiB ./icon_download --db "$DATABASE_URL" --log-file icon_download.log --icons-dir ~/icons --log-errors-only
Stage 4: Best Icon Selection
Picks the best icon per host for display.
psql $DATABASE_URL -f ./everytab/pipeline/04_best_icon/select.sql
Stage 5: Bundle Generation
Converts icons to PNG, assembles JSON bundles, uploads to S3.
./bundle_gen --db "$DATABASE_URL" --log-file bundle_gen.log --log-errors-only
Note the TOTAL_BUNDLES number from the summary — this gets baked into the frontend.
Stage 6: Frontend Deploy
From EC2, after bundle gen completes:
TOTAL_BUNDLES=$(jq -r '.bundles_created' stats/05_bundle_gen.json)
./everytab/pipeline/06_frontend/deploy.sh --total-bundles "$TOTAL_BUNDLES"
The deploy script:
- Injects TOTAL_BUNDLES into index.html
- Minifies site.js (via esbuild, strips comments + whitespace)
- Uploads frontend files to S3
- Deletes stale bundles from previous runs (numbers ≥ TOTAL_BUNDLES)
- Invalidates CloudFront cache
Stage 7: Backup to Homelab
After the site is deployed and verified, backup data before tearing down scanning infra.
What to backup:
| Data | Location on EC2 | Size estimate (30M) | Purpose |
|---|---|---|---|
| Database | pg_dump from i3 instance | ~5-10GB compressed | Full hosts + icons metadata, titles, WARC coordinates |
| Icons | ~/icons/ directory |
~500GB-1TB | Complete favicon archive, content-addressed by SHA-256 |
| Stats | ~/stats/*.json |
<1MB | Pipeline timing and counts per stage |
| Logs | ~/*.log |
varies | Error logs for debugging |
Backup commands:
Use -z for compression — reduces bytes on the wire which reduces AWS outbound egress costs ($0.09/GB). Icons are already compressed formats (PNG/ICO) so savings are ~5-10%, but on 500GB that's $2-4.
# 1. Database dump (run on EC2, fast — dumps to local disk first)
pg_dump -Fc $DATABASE_URL > ~/everytab_dump.pgfc
# 2. Transfer database to homelab
rsync -avPz ~/everytab_dump.pgfc homelab:/backups/everytab/
# 3. Transfer icons to homelab (this is the big one — 500GB+, will take hours)
# --ignore-existing skips icons already on homelab (for incremental monthly backups)
rsync -avPz --ignore-existing ~/icons/ homelab:/backups/everytab/icons/
# 4. Transfer stats and logs
rsync -avPz ~/stats/ homelab:/backups/everytab/stats/
rsync -avPz ~/*.log homelab:/backups/everytab/logs/
Verify on homelab:
# Check database restores
pg_restore -d everytab_local /backups/everytab/everytab_dump.pgfc
psql everytab_local -c "SELECT COUNT(*) FROM hosts; SELECT COUNT(*) FROM icons;"
# Check icon count matches
find /backups/everytab/icons/ -type f | wc -l
Then tear down:
# From local machine
cd infra && terraform apply -var="scanning=false"
Note: The icons directory is the largest transfer. At home internet speeds (~300Mbps) transferring 500GB takes ~3-4 hours. Consider running rsync in tmux. If the transfer fails partway, rsync resumes where it left off.
Data transfer costs: AWS charges for outbound data. 500GB outbound = ~$45. For the 3M dev run (~50GB) it's ~$5. To avoid this, you could attach an EBS volume, snapshot it, and restore it in your homelab's AWS account — but rsync is simpler.
Full Reset (start over)
psql $DATABASE_URL -c "TRUNCATE hosts CASCADE;"
# Then run from Stage 1
Future: Automated Monthly Run
For automating with new Common Crawl releases, wrap stages 1-6 in a script that:
- Truncates old data
- Runs each stage, checking exit codes
- Compares stats against previous run (alert if coverage drops significantly)
- Deploys only if all stages succeed
- Sends notification on completion or failure
The pipeline is already designed for this — each stage writes stats JSON, uses CLI flags for configuration, and is resumable on failure.