# Pipeline Run these stages in order on the EC2 instance. Each stage is a single command. Between stages, check the validation queries to confirm data looks right before proceeding. All stages are idempotent — safe to re-run if interrupted. ## Prerequisites ```bash # Database URL in environment export DATABASE_URL='postgres://everytab:PASS@RDS_ENDPOINT:5432/everytab' # Schema created psql $DATABASE_URL -f pipeline/01_cc_index/schema.sql # Go binaries built and on the EC2 instance # (build locally with CGO_ENABLED=0 GOOS=linux GOARCH=amd64, then scp) ``` ## Stage 1: CC-Index Query Populates the `hosts` table from Common Crawl's columnar index. ```bash ./pipeline/01_cc_index/query.sh --db-url "$DATABASE_URL" --limit 100000 # Full run: --limit 0 ``` **Check:** `psql $DATABASE_URL -c "SELECT COUNT(*), protocol FROM hosts GROUP BY protocol;"` ## Stage 2: WARC Parsing Fetches WARC records from CC's S3, extracts titles, icons, and iframe headers. ```bash ./warc_parse --db "$DATABASE_URL" --log-file warc_parse.log ``` **Check:** `psql $DATABASE_URL -c "SELECT COUNT(*) FROM hosts WHERE parsed = TRUE; SELECT COUNT(*) FROM icons;"` ## Stage 3: Icon Download Downloads favicons from the live web, validates, uploads to S3. ```bash ./icon_download --db "$DATABASE_URL" --log-file icon_download.log --log-errors-only ``` **Check:** `psql $DATABASE_URL -c "SELECT scan_state, COUNT(*) FROM icons GROUP BY scan_state;"` ## Stage 4: Best Icon Selection Picks the best icon per host for display. ```bash psql $DATABASE_URL -f pipeline/04_best_icon/select.sql ``` ## Stage 5: Bundle Generation Converts icons to PNG, assembles JSON bundles, uploads to S3. ```bash ./bundle_gen --db "$DATABASE_URL" --log-file bundle_gen.log --log-errors-only ``` **Check:** `aws s3 ls s3://everytab-site/tabs/ | wc -l` Note the `TOTAL_BUNDLES` number from the summary — this gets baked into the frontend. ## Stage 6: Frontend Deploy ```bash # Update TOTAL_BUNDLES in index.html sed -i "s/const TOTAL_BUNDLES = .*/const TOTAL_BUNDLES = NUM;/" frontend/index.html # Upload frontend files aws s3 cp frontend/index.html s3://everytab-site/ aws s3 cp frontend/site.js s3://everytab-site/ aws s3 cp frontend/bot.html s3://everytab-site/ # Invalidate CloudFront cache (if configured) # aws cloudfront create-invalidation --distribution-id DIST_ID --paths "/*" ``` ## Full Reset (start over) ```bash psql $DATABASE_URL -c "TRUNCATE hosts CASCADE;" # Then run from Stage 1 ``` ## Future: Automated Monthly Run For automating with new Common Crawl releases, wrap stages 1-6 in a script that: 1. Truncates old data 2. Runs each stage, checking exit codes 3. Compares stats against previous run (alert if coverage drops significantly) 4. Deploys only if all stages succeed 5. Sends notification on completion or failure The pipeline is already designed for this — each stage writes stats JSON, uses CLI flags for configuration, and is resumable on failure.