# Pipeline Run these stages in order on the EC2 instance. Each stage is a single command. Between stages, run the sanity checks to confirm data looks right before proceeding. All stages are idempotent — safe to re-run if interrupted. ## Prerequisites ```bash # Postgres on i3 instance (run infra/db-setup.sh on the i3 first) export DATABASE_URL='postgres://everytab@:5432/everytab' # Go binaries built on EC2 cd ~/everytab go build -o ~/warc_parse ./pipeline/02_warc_parse/ go build -o ~/icon_download ./pipeline/03_icon_download/ go build -o ~/bundle_gen ./pipeline/05_bundle_gen/ ``` ## Stage 1: CC-Index Query Populates the `hosts` table from Common Crawl's columnar index. ```bash ./pipeline/01_cc_index/query.sh --db-url "$DATABASE_URL" --limit 100000 # Full run: --limit 0 ``` ## Stage 2: WARC Parsing Fetches WARC records from CC's S3, extracts titles, icons, and iframe headers. ```bash ./warc_parse --db "$DATABASE_URL" --log-file warc_parse.log --log-errors-only ``` ## Stage 3: Icon Download Downloads favicons from the live web, validates, writes to local disk. ```bash GOMEMLIMIT=12GiB ./icon_download --db "$DATABASE_URL" --log-file icon_download.log --icons-dir ~/icons --log-errors-only ``` ## Stage 4: Best Icon Selection Picks the best icon per host for display. ```bash psql $DATABASE_URL -f pipeline/04_best_icon/select.sql ``` ## Stage 5: Bundle Generation Converts icons to PNG, assembles JSON bundles, uploads to S3. ```bash ./bundle_gen --db "$DATABASE_URL" --log-file bundle_gen.log --log-errors-only ``` Note the `TOTAL_BUNDLES` number from the summary — this gets baked into the frontend. ## Stage 6: Frontend Deploy From your local machine: ```bash ./pipeline/06_frontend/deploy.sh --total-bundles ``` ## Stage 7: Backup to Homelab After the site is deployed and verified, backup data before tearing down scanning infra. **What to backup:** | Data | Location on EC2 | Size estimate (30M) | Purpose | |------|----------------|---------------------|---------| | Database | RDS (pg_dump) | ~5-10GB compressed | Full hosts + icons metadata, titles, WARC coordinates | | Icons | `~/icons/` directory | ~500GB-1TB | Complete favicon archive, content-addressed by SHA-256 | | Stats | `~/stats/*.json` | <1MB | Pipeline timing and counts per stage | | Logs | `~/*.log` | varies | Error logs for debugging | **Backup commands:** Use `-z` for compression — reduces bytes on the wire which reduces AWS outbound egress costs ($0.09/GB). Icons are already compressed formats (PNG/ICO) so savings are ~5-10%, but on 500GB that's $2-4. ```bash # 1. Database dump (run on EC2, fast — dumps to local disk first) pg_dump -Fc $DATABASE_URL > ~/everytab_dump.pgfc # 2. Transfer database to homelab rsync -avPz ~/everytab_dump.pgfc homelab:/backups/everytab/ # 3. Transfer icons to homelab (this is the big one — 500GB+, will take hours) # --ignore-existing skips icons already on homelab (for incremental monthly backups) rsync -avPz --ignore-existing ~/icons/ homelab:/backups/everytab/icons/ # 4. Transfer stats and logs rsync -avPz ~/stats/ homelab:/backups/everytab/stats/ rsync -avPz ~/*.log homelab:/backups/everytab/logs/ ``` **Verify on homelab:** ```bash # Check database restores pg_restore -d everytab_local /backups/everytab/everytab_dump.pgfc psql everytab_local -c "SELECT COUNT(*) FROM hosts; SELECT COUNT(*) FROM icons;" # Check icon count matches find /backups/everytab/icons/ -type f | wc -l ``` **Then tear down:** ```bash # From local machine cd infra && terraform apply -var="scanning=false" ``` **Note:** The icons directory is the largest transfer. At home internet speeds (~300Mbps) transferring 500GB takes ~3-4 hours. Consider running rsync in tmux. If the transfer fails partway, rsync resumes where it left off. **Data transfer costs:** AWS charges for outbound data. 500GB outbound = ~$45. For the 3M dev run (~50GB) it's ~$5. To avoid this, you could attach an EBS volume, snapshot it, and restore it in your homelab's AWS account — but rsync is simpler. ## Full Reset (start over) ```bash psql $DATABASE_URL -c "TRUNCATE hosts CASCADE;" # Then run from Stage 1 ``` ## Future: Automated Monthly Run For automating with new Common Crawl releases, wrap stages 1-6 in a script that: 1. Truncates old data 2. Runs each stage, checking exit codes 3. Compares stats against previous run (alert if coverage drops significantly) 4. Deploys only if all stages succeed 5. Sends notification on completion or failure The pipeline is already designed for this — each stage writes stats JSON, uses CLI flags for configuration, and is resumable on failure.