History

Joe Lothan 758ab3080b interleve no icon hosts and icons hosts for an even mix		2026-05-26 23:50:02 -04:00
..
01_cc_index	order by downloaded time to improve ebs read performance	2026-05-26 23:10:53 -04:00
02_warc_parse	updated number of async writers to warc_parse to accomidate faster db nvme write speeds	2026-05-25 19:41:25 -04:00
03_icon_download	removed shuffling of hosts to keep hostid for better bundling	2026-05-26 01:42:07 -04:00
04_best_icon	order by downloaded time to improve ebs read performance	2026-05-26 23:10:53 -04:00
05_bundle_gen	interleve no icon hosts and icons hosts for an even mix	2026-05-26 23:50:02 -04:00
06_frontend	bumped padding of bundles to 6 digits	2026-05-25 23:56:13 -04:00
README.md	deploy frontend from the ec2 at the end of the pipeline	2026-05-25 23:21:50 -04:00
run.sh	add one run.sh for the entire pipeline	2026-05-26 02:09:57 -04:00

README.md

Pipeline

Run these stages in order on the EC2 instance. Each stage is a single command.

Between stages, run the sanity checks to confirm data looks right before proceeding. All stages are idempotent — safe to re-run if interrupted.

Prerequisites

# Postgres on i3 instance (run infra/db-setup.sh on the i3 first)
export DATABASE_URL='postgres://everytab@<i3-private-ip>:5432/everytab'

# Go binaries built on EC2
go build -o ~/warc_parse ./everytab/pipeline/02_warc_parse/
go build -o ~/icon_download ./everytab/pipeline/03_icon_download/
go build -o ~/bundle_gen ./everytab/pipeline/05_bundle_gen/

Stage 1: CC-Index Query

Populates the hosts table from Common Crawl's columnar index.

./everytab/pipeline/01_cc_index/query.sh --db-url "$DATABASE_URL" --limit 100000
# Full run: --limit 0

Stage 2: WARC Parsing

Fetches WARC records from CC's S3, extracts titles, icons, and iframe headers.

./warc_parse --db "$DATABASE_URL" --log-file warc_parse.log --log-errors-only

Stage 3: Icon Download

Downloads favicons from the live web, validates, writes to local disk.

GOMEMLIMIT=12GiB ./icon_download --db "$DATABASE_URL" --log-file icon_download.log --icons-dir ~/icons --log-errors-only

Stage 4: Best Icon Selection

Picks the best icon per host for display.

psql $DATABASE_URL -f ./everytab/pipeline/04_best_icon/select.sql

Stage 5: Bundle Generation

Converts icons to PNG, assembles JSON bundles, uploads to S3.

./bundle_gen --db "$DATABASE_URL" --log-file bundle_gen.log --log-errors-only

Note the TOTAL_BUNDLES number from the summary — this gets baked into the frontend.

Stage 6: Frontend Deploy

From EC2, after bundle gen completes:

TOTAL_BUNDLES=$(jq -r '.bundles_created' stats/05_bundle_gen.json)
./everytab/pipeline/06_frontend/deploy.sh --total-bundles "$TOTAL_BUNDLES"

The deploy script:

Injects TOTAL_BUNDLES into index.html
Minifies site.js (via esbuild, strips comments + whitespace)
Uploads frontend files to S3
Deletes stale bundles from previous runs (numbers ≥ TOTAL_BUNDLES)
Invalidates CloudFront cache

Stage 7: Backup to Homelab

After the site is deployed and verified, backup data before tearing down scanning infra.

What to backup:

Data	Location on EC2	Size estimate (30M)	Purpose
Database	pg_dump from i3 instance	~5-10GB compressed	Full hosts + icons metadata, titles, WARC coordinates
Icons	`~/icons/` directory	~500GB-1TB	Complete favicon archive, content-addressed by SHA-256
Stats	`~/stats/*.json`	<1MB	Pipeline timing and counts per stage
Logs	`~/*.log`	varies	Error logs for debugging

Backup commands:

Use -z for compression — reduces bytes on the wire which reduces AWS outbound egress costs ($0.09/GB). Icons are already compressed formats (PNG/ICO) so savings are ~5-10%, but on 500GB that's $2-4.

# 1. Database dump (run on EC2, fast — dumps to local disk first)
pg_dump -Fc $DATABASE_URL > ~/everytab_dump.pgfc

# 2. Transfer database to homelab
rsync -avPz ~/everytab_dump.pgfc homelab:/backups/everytab/

# 3. Transfer icons to homelab (this is the big one — 500GB+, will take hours)
#    --ignore-existing skips icons already on homelab (for incremental monthly backups)
rsync -avPz --ignore-existing ~/icons/ homelab:/backups/everytab/icons/

# 4. Transfer stats and logs
rsync -avPz ~/stats/ homelab:/backups/everytab/stats/
rsync -avPz ~/*.log homelab:/backups/everytab/logs/

Verify on homelab:

# Check database restores
pg_restore -d everytab_local /backups/everytab/everytab_dump.pgfc
psql everytab_local -c "SELECT COUNT(*) FROM hosts; SELECT COUNT(*) FROM icons;"

# Check icon count matches
find /backups/everytab/icons/ -type f | wc -l

Then tear down:

# From local machine
cd infra && terraform apply -var="scanning=false"

Note: The icons directory is the largest transfer. At home internet speeds (~300Mbps) transferring 500GB takes ~3-4 hours. Consider running rsync in tmux. If the transfer fails partway, rsync resumes where it left off.

Data transfer costs: AWS charges for outbound data. 500GB outbound = ~$45. For the 3M dev run (~50GB) it's ~$5. To avoid this, you could attach an EBS volume, snapshot it, and restore it in your homelab's AWS account — but rsync is simpler.

Full Reset (start over)

psql $DATABASE_URL -c "TRUNCATE hosts CASCADE;"
# Then run from Stage 1

Future: Automated Monthly Run

For automating with new Common Crawl releases, wrap stages 1-6 in a script that:

Truncates old data
Runs each stage, checking exit codes
Compares stats against previous run (alert if coverage drops significantly)
Deploys only if all stages succeed
Sends notification on completion or failure

The pipeline is already designed for this — each stage writes stats JSON, uses CLI flags for configuration, and is resumable on failure.