updated pipeline README
This commit is contained in:
parent
1b665d1065
commit
a28cd2b056
1 changed files with 59 additions and 21 deletions
|
|
@ -2,7 +2,7 @@
|
||||||
|
|
||||||
Run these stages in order on the EC2 instance. Each stage is a single command.
|
Run these stages in order on the EC2 instance. Each stage is a single command.
|
||||||
|
|
||||||
Between stages, check the validation queries to confirm data looks right before proceeding. All stages are idempotent — safe to re-run if interrupted.
|
Between stages, run the sanity checks to confirm data looks right before proceeding. All stages are idempotent — safe to re-run if interrupted.
|
||||||
|
|
||||||
## Prerequisites
|
## Prerequisites
|
||||||
|
|
||||||
|
|
@ -13,8 +13,11 @@ export DATABASE_URL='postgres://everytab:PASS@RDS_ENDPOINT:5432/everytab'
|
||||||
# Schema created
|
# Schema created
|
||||||
psql $DATABASE_URL -f pipeline/01_cc_index/schema.sql
|
psql $DATABASE_URL -f pipeline/01_cc_index/schema.sql
|
||||||
|
|
||||||
# Go binaries built and on the EC2 instance
|
# Go binaries built on EC2
|
||||||
# (build locally with CGO_ENABLED=0 GOOS=linux GOARCH=amd64, then scp)
|
cd ~/everytab
|
||||||
|
go build -o ~/warc_parse ./pipeline/02_warc_parse/
|
||||||
|
go build -o ~/icon_download ./pipeline/03_icon_download/
|
||||||
|
go build -o ~/bundle_gen ./pipeline/05_bundle_gen/
|
||||||
```
|
```
|
||||||
|
|
||||||
## Stage 1: CC-Index Query
|
## Stage 1: CC-Index Query
|
||||||
|
|
@ -26,8 +29,6 @@ Populates the `hosts` table from Common Crawl's columnar index.
|
||||||
# Full run: --limit 0
|
# Full run: --limit 0
|
||||||
```
|
```
|
||||||
|
|
||||||
**Check:** `psql $DATABASE_URL -c "SELECT COUNT(*), protocol FROM hosts GROUP BY protocol;"`
|
|
||||||
|
|
||||||
## Stage 2: WARC Parsing
|
## Stage 2: WARC Parsing
|
||||||
|
|
||||||
Fetches WARC records from CC's S3, extracts titles, icons, and iframe headers.
|
Fetches WARC records from CC's S3, extracts titles, icons, and iframe headers.
|
||||||
|
|
@ -36,8 +37,6 @@ Fetches WARC records from CC's S3, extracts titles, icons, and iframe headers.
|
||||||
./warc_parse --db "$DATABASE_URL" --log-file warc_parse.log
|
./warc_parse --db "$DATABASE_URL" --log-file warc_parse.log
|
||||||
```
|
```
|
||||||
|
|
||||||
**Check:** `psql $DATABASE_URL -c "SELECT COUNT(*) FROM hosts WHERE parsed = TRUE; SELECT COUNT(*) FROM icons;"`
|
|
||||||
|
|
||||||
## Stage 3: Icon Download
|
## Stage 3: Icon Download
|
||||||
|
|
||||||
Downloads favicons from the live web, validates, uploads to S3.
|
Downloads favicons from the live web, validates, uploads to S3.
|
||||||
|
|
@ -46,8 +45,6 @@ Downloads favicons from the live web, validates, uploads to S3.
|
||||||
./icon_download --db "$DATABASE_URL" --log-file icon_download.log --log-errors-only
|
./icon_download --db "$DATABASE_URL" --log-file icon_download.log --log-errors-only
|
||||||
```
|
```
|
||||||
|
|
||||||
**Check:** `psql $DATABASE_URL -c "SELECT scan_state, COUNT(*) FROM icons GROUP BY scan_state;"`
|
|
||||||
|
|
||||||
## Stage 4: Best Icon Selection
|
## Stage 4: Best Icon Selection
|
||||||
|
|
||||||
Picks the best icon per host for display.
|
Picks the best icon per host for display.
|
||||||
|
|
@ -64,25 +61,66 @@ Converts icons to PNG, assembles JSON bundles, uploads to S3.
|
||||||
./bundle_gen --db "$DATABASE_URL" --log-file bundle_gen.log --log-errors-only
|
./bundle_gen --db "$DATABASE_URL" --log-file bundle_gen.log --log-errors-only
|
||||||
```
|
```
|
||||||
|
|
||||||
**Check:** `aws s3 ls s3://everytab-site/tabs/ | wc -l`
|
|
||||||
|
|
||||||
Note the `TOTAL_BUNDLES` number from the summary — this gets baked into the frontend.
|
Note the `TOTAL_BUNDLES` number from the summary — this gets baked into the frontend.
|
||||||
|
|
||||||
## Stage 6: Frontend Deploy
|
## Stage 6: Frontend Deploy
|
||||||
|
|
||||||
|
From your local machine:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Update TOTAL_BUNDLES in index.html
|
./pipeline/06_frontend/deploy.sh --total-bundles <NUMBER>
|
||||||
sed -i "s/const TOTAL_BUNDLES = .*/const TOTAL_BUNDLES = NUM;/" frontend/index.html
|
|
||||||
|
|
||||||
# Upload frontend files
|
|
||||||
aws s3 cp frontend/index.html s3://everytab-site/
|
|
||||||
aws s3 cp frontend/site.js s3://everytab-site/
|
|
||||||
aws s3 cp frontend/bot.html s3://everytab-site/
|
|
||||||
|
|
||||||
# Invalidate CloudFront cache (if configured)
|
|
||||||
# aws cloudfront create-invalidation --distribution-id DIST_ID --paths "/*"
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## Stage 7: Backup to Homelab
|
||||||
|
|
||||||
|
After the site is deployed and verified, backup data before tearing down scanning infra.
|
||||||
|
|
||||||
|
**What to backup:**
|
||||||
|
|
||||||
|
| Data | Location on EC2 | Size estimate (30M) | Purpose |
|
||||||
|
|------|----------------|---------------------|---------|
|
||||||
|
| Database | RDS (pg_dump) | ~5-10GB compressed | Full hosts + icons metadata, titles, WARC coordinates |
|
||||||
|
| Icons | `~/icons/` directory | ~500GB-1TB | Complete favicon archive, content-addressed by SHA-256 |
|
||||||
|
| Stats | `~/stats/*.json` | <1MB | Pipeline timing and counts per stage |
|
||||||
|
| Logs | `~/*.log` | varies | Error logs for debugging |
|
||||||
|
|
||||||
|
**Backup commands:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 1. Database dump (run on EC2, fast — dumps to local disk first)
|
||||||
|
pg_dump -Fc $DATABASE_URL > ~/everytab_dump.pgfc
|
||||||
|
|
||||||
|
# 2. Transfer database to homelab
|
||||||
|
rsync -avP ~/everytab_dump.pgfc homelab:/backups/everytab/
|
||||||
|
|
||||||
|
# 3. Transfer icons to homelab (this is the big one — 500GB+, will take hours)
|
||||||
|
rsync -avP ~/icons/ homelab:/backups/everytab/icons/
|
||||||
|
|
||||||
|
# 4. Transfer stats and logs
|
||||||
|
rsync -avP ~/stats/ homelab:/backups/everytab/stats/
|
||||||
|
rsync -avP ~/*.log homelab:/backups/everytab/logs/
|
||||||
|
```
|
||||||
|
|
||||||
|
**Verify on homelab:**
|
||||||
|
```bash
|
||||||
|
# Check database restores
|
||||||
|
pg_restore -d everytab_local /backups/everytab/everytab_dump.pgfc
|
||||||
|
psql everytab_local -c "SELECT COUNT(*) FROM hosts; SELECT COUNT(*) FROM icons;"
|
||||||
|
|
||||||
|
# Check icon count matches
|
||||||
|
find /backups/everytab/icons/ -type f | wc -l
|
||||||
|
```
|
||||||
|
|
||||||
|
**Then tear down:**
|
||||||
|
```bash
|
||||||
|
# From local machine
|
||||||
|
cd infra && terraform apply -var="scanning=false"
|
||||||
|
```
|
||||||
|
|
||||||
|
**Note:** The icons directory is the largest transfer. At home internet speeds (~300Mbps) transferring 500GB takes ~3-4 hours. Consider running rsync in tmux. If the transfer fails partway, rsync resumes where it left off.
|
||||||
|
|
||||||
|
**Data transfer costs:** AWS charges for outbound data. 500GB outbound = ~$45. For the 3M dev run (~50GB) it's ~$5. To avoid this, you could attach an EBS volume, snapshot it, and restore it in your homelab's AWS account — but rsync is simpler.
|
||||||
|
|
||||||
## Full Reset (start over)
|
## Full Reset (start over)
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue