From a28cd2b0565cb96ecf5d1ee587c16d0e15539d8d Mon Sep 17 00:00:00 2001 From: Joe Lothan Date: Tue, 19 May 2026 13:06:48 -0400 Subject: [PATCH] updated pipeline README --- pipeline/README.md | 80 ++++++++++++++++++++++++++++++++++------------ 1 file changed, 59 insertions(+), 21 deletions(-) diff --git a/pipeline/README.md b/pipeline/README.md index 4582d9e..b56f0f5 100644 --- a/pipeline/README.md +++ b/pipeline/README.md @@ -2,7 +2,7 @@ Run these stages in order on the EC2 instance. Each stage is a single command. -Between stages, check the validation queries to confirm data looks right before proceeding. All stages are idempotent — safe to re-run if interrupted. +Between stages, run the sanity checks to confirm data looks right before proceeding. All stages are idempotent — safe to re-run if interrupted. ## Prerequisites @@ -13,8 +13,11 @@ export DATABASE_URL='postgres://everytab:PASS@RDS_ENDPOINT:5432/everytab' # Schema created psql $DATABASE_URL -f pipeline/01_cc_index/schema.sql -# Go binaries built and on the EC2 instance -# (build locally with CGO_ENABLED=0 GOOS=linux GOARCH=amd64, then scp) +# Go binaries built on EC2 +cd ~/everytab +go build -o ~/warc_parse ./pipeline/02_warc_parse/ +go build -o ~/icon_download ./pipeline/03_icon_download/ +go build -o ~/bundle_gen ./pipeline/05_bundle_gen/ ``` ## Stage 1: CC-Index Query @@ -26,8 +29,6 @@ Populates the `hosts` table from Common Crawl's columnar index. # Full run: --limit 0 ``` -**Check:** `psql $DATABASE_URL -c "SELECT COUNT(*), protocol FROM hosts GROUP BY protocol;"` - ## Stage 2: WARC Parsing Fetches WARC records from CC's S3, extracts titles, icons, and iframe headers. @@ -36,8 +37,6 @@ Fetches WARC records from CC's S3, extracts titles, icons, and iframe headers. ./warc_parse --db "$DATABASE_URL" --log-file warc_parse.log ``` -**Check:** `psql $DATABASE_URL -c "SELECT COUNT(*) FROM hosts WHERE parsed = TRUE; SELECT COUNT(*) FROM icons;"` - ## Stage 3: Icon Download Downloads favicons from the live web, validates, uploads to S3. @@ -46,8 +45,6 @@ Downloads favicons from the live web, validates, uploads to S3. ./icon_download --db "$DATABASE_URL" --log-file icon_download.log --log-errors-only ``` -**Check:** `psql $DATABASE_URL -c "SELECT scan_state, COUNT(*) FROM icons GROUP BY scan_state;"` - ## Stage 4: Best Icon Selection Picks the best icon per host for display. @@ -64,25 +61,66 @@ Converts icons to PNG, assembles JSON bundles, uploads to S3. ./bundle_gen --db "$DATABASE_URL" --log-file bundle_gen.log --log-errors-only ``` -**Check:** `aws s3 ls s3://everytab-site/tabs/ | wc -l` - Note the `TOTAL_BUNDLES` number from the summary — this gets baked into the frontend. ## Stage 6: Frontend Deploy +From your local machine: + ```bash -# Update TOTAL_BUNDLES in index.html -sed -i "s/const TOTAL_BUNDLES = .*/const TOTAL_BUNDLES = NUM;/" frontend/index.html - -# Upload frontend files -aws s3 cp frontend/index.html s3://everytab-site/ -aws s3 cp frontend/site.js s3://everytab-site/ -aws s3 cp frontend/bot.html s3://everytab-site/ - -# Invalidate CloudFront cache (if configured) -# aws cloudfront create-invalidation --distribution-id DIST_ID --paths "/*" +./pipeline/06_frontend/deploy.sh --total-bundles ``` +## Stage 7: Backup to Homelab + +After the site is deployed and verified, backup data before tearing down scanning infra. + +**What to backup:** + +| Data | Location on EC2 | Size estimate (30M) | Purpose | +|------|----------------|---------------------|---------| +| Database | RDS (pg_dump) | ~5-10GB compressed | Full hosts + icons metadata, titles, WARC coordinates | +| Icons | `~/icons/` directory | ~500GB-1TB | Complete favicon archive, content-addressed by SHA-256 | +| Stats | `~/stats/*.json` | <1MB | Pipeline timing and counts per stage | +| Logs | `~/*.log` | varies | Error logs for debugging | + +**Backup commands:** + +```bash +# 1. Database dump (run on EC2, fast — dumps to local disk first) +pg_dump -Fc $DATABASE_URL > ~/everytab_dump.pgfc + +# 2. Transfer database to homelab +rsync -avP ~/everytab_dump.pgfc homelab:/backups/everytab/ + +# 3. Transfer icons to homelab (this is the big one — 500GB+, will take hours) +rsync -avP ~/icons/ homelab:/backups/everytab/icons/ + +# 4. Transfer stats and logs +rsync -avP ~/stats/ homelab:/backups/everytab/stats/ +rsync -avP ~/*.log homelab:/backups/everytab/logs/ +``` + +**Verify on homelab:** +```bash +# Check database restores +pg_restore -d everytab_local /backups/everytab/everytab_dump.pgfc +psql everytab_local -c "SELECT COUNT(*) FROM hosts; SELECT COUNT(*) FROM icons;" + +# Check icon count matches +find /backups/everytab/icons/ -type f | wc -l +``` + +**Then tear down:** +```bash +# From local machine +cd infra && terraform apply -var="scanning=false" +``` + +**Note:** The icons directory is the largest transfer. At home internet speeds (~300Mbps) transferring 500GB takes ~3-4 hours. Consider running rsync in tmux. If the transfer fails partway, rsync resumes where it left off. + +**Data transfer costs:** AWS charges for outbound data. 500GB outbound = ~$45. For the 3M dev run (~50GB) it's ~$5. To avoid this, you could attach an EBS volume, snapshot it, and restore it in your homelab's AWS account — but rsync is simpler. + ## Full Reset (start over) ```bash