History

Joe Lothan cf17fc42b1 fixed icon downloading performance issues		2026-05-19 10:32:34 -04:00
..
01_cc_index	added query.sh to read the cc-index from s3 parquet files and dump it into our psql db	2026-05-17 19:12:25 -04:00
02_warc_parse	added warc parser	2026-05-17 20:25:59 -04:00
03_icon_download	fixed icon downloading performance issues	2026-05-19 10:32:34 -04:00
04_best_icon	don't allow 1 pixel favicons	2026-05-17 23:01:53 -04:00
05_bundle_gen	switched from s3 to disk for saving icons	2026-05-18 12:43:50 -04:00
06_frontend	fix TOTAL_BUNDLES sed command in deploy script	2026-05-18 01:00:09 -04:00
README.md	added initial pipeline README	2026-05-18 00:40:57 -04:00

README.md

Pipeline

Run these stages in order on the EC2 instance. Each stage is a single command.

Between stages, check the validation queries to confirm data looks right before proceeding. All stages are idempotent — safe to re-run if interrupted.

Prerequisites

# Database URL in environment
export DATABASE_URL='postgres://everytab:PASS@RDS_ENDPOINT:5432/everytab'

# Schema created
psql $DATABASE_URL -f pipeline/01_cc_index/schema.sql

# Go binaries built and on the EC2 instance
# (build locally with CGO_ENABLED=0 GOOS=linux GOARCH=amd64, then scp)

Stage 1: CC-Index Query

Populates the hosts table from Common Crawl's columnar index.

./pipeline/01_cc_index/query.sh --db-url "$DATABASE_URL" --limit 100000
# Full run: --limit 0

Check: psql $DATABASE_URL -c "SELECT COUNT(*), protocol FROM hosts GROUP BY protocol;"

Stage 2: WARC Parsing

Fetches WARC records from CC's S3, extracts titles, icons, and iframe headers.

./warc_parse --db "$DATABASE_URL" --log-file warc_parse.log

Check: psql $DATABASE_URL -c "SELECT COUNT(*) FROM hosts WHERE parsed = TRUE; SELECT COUNT(*) FROM icons;"

Stage 3: Icon Download

Downloads favicons from the live web, validates, uploads to S3.

./icon_download --db "$DATABASE_URL" --log-file icon_download.log --log-errors-only

Check: psql $DATABASE_URL -c "SELECT scan_state, COUNT(*) FROM icons GROUP BY scan_state;"

Stage 4: Best Icon Selection

Picks the best icon per host for display.

psql $DATABASE_URL -f pipeline/04_best_icon/select.sql

Stage 5: Bundle Generation

Converts icons to PNG, assembles JSON bundles, uploads to S3.

./bundle_gen --db "$DATABASE_URL" --log-file bundle_gen.log --log-errors-only

Check: aws s3 ls s3://everytab-site/tabs/ | wc -l

Note the TOTAL_BUNDLES number from the summary — this gets baked into the frontend.

Stage 6: Frontend Deploy

# Update TOTAL_BUNDLES in index.html
sed -i "s/const TOTAL_BUNDLES = .*/const TOTAL_BUNDLES = NUM;/" frontend/index.html

# Upload frontend files
aws s3 cp frontend/index.html s3://everytab-site/
aws s3 cp frontend/site.js s3://everytab-site/
aws s3 cp frontend/bot.html s3://everytab-site/

# Invalidate CloudFront cache (if configured)
# aws cloudfront create-invalidation --distribution-id DIST_ID --paths "/*"

Full Reset (start over)

psql $DATABASE_URL -c "TRUNCATE hosts CASCADE;"
# Then run from Stage 1

Future: Automated Monthly Run

For automating with new Common Crawl releases, wrap stages 1-6 in a script that:

Truncates old data
Runs each stage, checking exit codes
Compares stats against previous run (alert if coverage drops significantly)
Deploys only if all stages succeed
Sends notification on completion or failure

The pipeline is already designed for this — each stage writes stats JSON, uses CLI flags for configuration, and is resumable on failure.