From a977a8c0b3db15db48488ab72a78b0516f00ab8c Mon Sep 17 00:00:00 2001 From: Joe Lothan Date: Mon, 18 May 2026 00:40:57 -0400 Subject: [PATCH] added initial pipeline README --- pipeline/README.md | 102 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 102 insertions(+) create mode 100644 pipeline/README.md diff --git a/pipeline/README.md b/pipeline/README.md new file mode 100644 index 0000000..4582d9e --- /dev/null +++ b/pipeline/README.md @@ -0,0 +1,102 @@ +# Pipeline + +Run these stages in order on the EC2 instance. Each stage is a single command. + +Between stages, check the validation queries to confirm data looks right before proceeding. All stages are idempotent — safe to re-run if interrupted. + +## Prerequisites + +```bash +# Database URL in environment +export DATABASE_URL='postgres://everytab:PASS@RDS_ENDPOINT:5432/everytab' + +# Schema created +psql $DATABASE_URL -f pipeline/01_cc_index/schema.sql + +# Go binaries built and on the EC2 instance +# (build locally with CGO_ENABLED=0 GOOS=linux GOARCH=amd64, then scp) +``` + +## Stage 1: CC-Index Query + +Populates the `hosts` table from Common Crawl's columnar index. + +```bash +./pipeline/01_cc_index/query.sh --db-url "$DATABASE_URL" --limit 100000 +# Full run: --limit 0 +``` + +**Check:** `psql $DATABASE_URL -c "SELECT COUNT(*), protocol FROM hosts GROUP BY protocol;"` + +## Stage 2: WARC Parsing + +Fetches WARC records from CC's S3, extracts titles, icons, and iframe headers. + +```bash +./warc_parse --db "$DATABASE_URL" --log-file warc_parse.log +``` + +**Check:** `psql $DATABASE_URL -c "SELECT COUNT(*) FROM hosts WHERE parsed = TRUE; SELECT COUNT(*) FROM icons;"` + +## Stage 3: Icon Download + +Downloads favicons from the live web, validates, uploads to S3. + +```bash +./icon_download --db "$DATABASE_URL" --log-file icon_download.log --log-errors-only +``` + +**Check:** `psql $DATABASE_URL -c "SELECT scan_state, COUNT(*) FROM icons GROUP BY scan_state;"` + +## Stage 4: Best Icon Selection + +Picks the best icon per host for display. + +```bash +psql $DATABASE_URL -f pipeline/04_best_icon/select.sql +``` + +## Stage 5: Bundle Generation + +Converts icons to PNG, assembles JSON bundles, uploads to S3. + +```bash +./bundle_gen --db "$DATABASE_URL" --log-file bundle_gen.log --log-errors-only +``` + +**Check:** `aws s3 ls s3://everytab-site/tabs/ | wc -l` + +Note the `TOTAL_BUNDLES` number from the summary — this gets baked into the frontend. + +## Stage 6: Frontend Deploy + +```bash +# Update TOTAL_BUNDLES in index.html +sed -i "s/const TOTAL_BUNDLES = .*/const TOTAL_BUNDLES = NUM;/" frontend/index.html + +# Upload frontend files +aws s3 cp frontend/index.html s3://everytab-site/ +aws s3 cp frontend/site.js s3://everytab-site/ +aws s3 cp frontend/bot.html s3://everytab-site/ + +# Invalidate CloudFront cache (if configured) +# aws cloudfront create-invalidation --distribution-id DIST_ID --paths "/*" +``` + +## Full Reset (start over) + +```bash +psql $DATABASE_URL -c "TRUNCATE hosts CASCADE;" +# Then run from Stage 1 +``` + +## Future: Automated Monthly Run + +For automating with new Common Crawl releases, wrap stages 1-6 in a script that: +1. Truncates old data +2. Runs each stage, checking exit codes +3. Compares stats against previous run (alert if coverage drops significantly) +4. Deploys only if all stages succeed +5. Sends notification on completion or failure + +The pipeline is already designed for this — each stage writes stats JSON, uses CLI flags for configuration, and is resumable on failure.