everytab/pipeline/README.md

# Pipeline

Run these stages in order on the EC2 instance. Each stage is a single command.

Between stages, check the validation queries to confirm data looks right before proceeding. All stages are idempotent — safe to re-run if interrupted.

## Prerequisites

```bash
# Database URL in environment
export DATABASE_URL='postgres://everytab:PASS@RDS_ENDPOINT:5432/everytab'

# Schema created
psql $DATABASE_URL -f pipeline/01_cc_index/schema.sql

# Go binaries built and on the EC2 instance
# (build locally with CGO_ENABLED=0 GOOS=linux GOARCH=amd64, then scp)
```

## Stage 1: CC-Index Query

Populates the `hosts` table from Common Crawl's columnar index.

```bash
./pipeline/01_cc_index/query.sh --db-url "$DATABASE_URL" --limit 100000
# Full run: --limit 0
```

**Check:** `psql $DATABASE_URL -c "SELECT COUNT(*), protocol FROM hosts GROUP BY protocol;"`

## Stage 2: WARC Parsing

Fetches WARC records from CC's S3, extracts titles, icons, and iframe headers.

```bash
./warc_parse --db "$DATABASE_URL" --log-file warc_parse.log
```

**Check:** `psql $DATABASE_URL -c "SELECT COUNT(*) FROM hosts WHERE parsed = TRUE; SELECT COUNT(*) FROM icons;"`

## Stage 3: Icon Download

Downloads favicons from the live web, validates, uploads to S3.

```bash
./icon_download --db "$DATABASE_URL" --log-file icon_download.log --log-errors-only
```

**Check:** `psql $DATABASE_URL -c "SELECT scan_state, COUNT(*) FROM icons GROUP BY scan_state;"`

## Stage 4: Best Icon Selection

Picks the best icon per host for display.

```bash
psql $DATABASE_URL -f pipeline/04_best_icon/select.sql
```

## Stage 5: Bundle Generation

Converts icons to PNG, assembles JSON bundles, uploads to S3.

```bash
./bundle_gen --db "$DATABASE_URL" --log-file bundle_gen.log --log-errors-only
```

**Check:** `aws s3 ls s3://everytab-site/tabs/ | wc -l`

Note the `TOTAL_BUNDLES` number from the summary — this gets baked into the frontend.

## Stage 6: Frontend Deploy

```bash
# Update TOTAL_BUNDLES in index.html
sed -i "s/const TOTAL_BUNDLES = .*/const TOTAL_BUNDLES = NUM;/" frontend/index.html

# Upload frontend files
aws s3 cp frontend/index.html s3://everytab-site/
aws s3 cp frontend/site.js s3://everytab-site/
aws s3 cp frontend/bot.html s3://everytab-site/

# Invalidate CloudFront cache (if configured)
# aws cloudfront create-invalidation --distribution-id DIST_ID --paths "/*"
```

## Full Reset (start over)

```bash
psql $DATABASE_URL -c "TRUNCATE hosts CASCADE;"
# Then run from Stage 1
```

## Future: Automated Monthly Run

For automating with new Common Crawl releases, wrap stages 1-6 in a script that:
1. Truncates old data
2. Runs each stage, checking exit codes
3. Compares stats against previous run (alert if coverage drops significantly)
4. Deploys only if all stages succeed
5. Sends notification on completion or failure

The pipeline is already designed for this — each stage writes stats JSON, uses CLI flags for configuration, and is resumable on failure.