140 lines
4.6 KiB
Markdown
140 lines
4.6 KiB
Markdown
# Pipeline
|
|
|
|
Run these stages in order on the EC2 instance. Each stage is a single command.
|
|
|
|
Between stages, run the sanity checks to confirm data looks right before proceeding. All stages are idempotent — safe to re-run if interrupted.
|
|
|
|
## Prerequisites
|
|
|
|
```bash
|
|
# Postgres on i3 instance (run infra/db-setup.sh on the i3 first)
|
|
export DATABASE_URL='postgres://everytab@<i3-private-ip>:5432/everytab'
|
|
|
|
# Go binaries built on EC2
|
|
cd ~/everytab
|
|
go build -o ~/warc_parse ./pipeline/02_warc_parse/
|
|
go build -o ~/icon_download ./pipeline/03_icon_download/
|
|
go build -o ~/bundle_gen ./pipeline/05_bundle_gen/
|
|
```
|
|
|
|
## Stage 1: CC-Index Query
|
|
|
|
Populates the `hosts` table from Common Crawl's columnar index.
|
|
|
|
```bash
|
|
./pipeline/01_cc_index/query.sh --db-url "$DATABASE_URL" --limit 100000
|
|
# Full run: --limit 0
|
|
```
|
|
|
|
## Stage 2: WARC Parsing
|
|
|
|
Fetches WARC records from CC's S3, extracts titles, icons, and iframe headers.
|
|
|
|
```bash
|
|
./warc_parse --db "$DATABASE_URL" --log-file warc_parse.log --log-errors-only
|
|
```
|
|
|
|
## Stage 3: Icon Download
|
|
|
|
Downloads favicons from the live web, validates, writes to local disk.
|
|
|
|
```bash
|
|
GOMEMLIMIT=12GiB ./icon_download --db "$DATABASE_URL" --log-file icon_download.log --icons-dir ~/icons --log-errors-only
|
|
```
|
|
|
|
## Stage 4: Best Icon Selection
|
|
|
|
Picks the best icon per host for display.
|
|
|
|
```bash
|
|
psql $DATABASE_URL -f pipeline/04_best_icon/select.sql
|
|
```
|
|
|
|
## Stage 5: Bundle Generation
|
|
|
|
Converts icons to PNG, assembles JSON bundles, uploads to S3.
|
|
|
|
```bash
|
|
./bundle_gen --db "$DATABASE_URL" --log-file bundle_gen.log --log-errors-only
|
|
```
|
|
|
|
Note the `TOTAL_BUNDLES` number from the summary — this gets baked into the frontend.
|
|
|
|
## Stage 6: Frontend Deploy
|
|
|
|
From your local machine:
|
|
|
|
```bash
|
|
./pipeline/06_frontend/deploy.sh --total-bundles <NUMBER>
|
|
```
|
|
|
|
## Stage 7: Backup to Homelab
|
|
|
|
After the site is deployed and verified, backup data before tearing down scanning infra.
|
|
|
|
**What to backup:**
|
|
|
|
| Data | Location on EC2 | Size estimate (30M) | Purpose |
|
|
|------|----------------|---------------------|---------|
|
|
| Database | RDS (pg_dump) | ~5-10GB compressed | Full hosts + icons metadata, titles, WARC coordinates |
|
|
| Icons | `~/icons/` directory | ~500GB-1TB | Complete favicon archive, content-addressed by SHA-256 |
|
|
| Stats | `~/stats/*.json` | <1MB | Pipeline timing and counts per stage |
|
|
| Logs | `~/*.log` | varies | Error logs for debugging |
|
|
|
|
**Backup commands:**
|
|
|
|
Use `-z` for compression — reduces bytes on the wire which reduces AWS outbound egress costs ($0.09/GB). Icons are already compressed formats (PNG/ICO) so savings are ~5-10%, but on 500GB that's $2-4.
|
|
|
|
```bash
|
|
# 1. Database dump (run on EC2, fast — dumps to local disk first)
|
|
pg_dump -Fc $DATABASE_URL > ~/everytab_dump.pgfc
|
|
|
|
# 2. Transfer database to homelab
|
|
rsync -avPz ~/everytab_dump.pgfc homelab:/backups/everytab/
|
|
|
|
# 3. Transfer icons to homelab (this is the big one — 500GB+, will take hours)
|
|
# --ignore-existing skips icons already on homelab (for incremental monthly backups)
|
|
rsync -avPz --ignore-existing ~/icons/ homelab:/backups/everytab/icons/
|
|
|
|
# 4. Transfer stats and logs
|
|
rsync -avPz ~/stats/ homelab:/backups/everytab/stats/
|
|
rsync -avPz ~/*.log homelab:/backups/everytab/logs/
|
|
```
|
|
|
|
**Verify on homelab:**
|
|
```bash
|
|
# Check database restores
|
|
pg_restore -d everytab_local /backups/everytab/everytab_dump.pgfc
|
|
psql everytab_local -c "SELECT COUNT(*) FROM hosts; SELECT COUNT(*) FROM icons;"
|
|
|
|
# Check icon count matches
|
|
find /backups/everytab/icons/ -type f | wc -l
|
|
```
|
|
|
|
**Then tear down:**
|
|
```bash
|
|
# From local machine
|
|
cd infra && terraform apply -var="scanning=false"
|
|
```
|
|
|
|
**Note:** The icons directory is the largest transfer. At home internet speeds (~300Mbps) transferring 500GB takes ~3-4 hours. Consider running rsync in tmux. If the transfer fails partway, rsync resumes where it left off.
|
|
|
|
**Data transfer costs:** AWS charges for outbound data. 500GB outbound = ~$45. For the 3M dev run (~50GB) it's ~$5. To avoid this, you could attach an EBS volume, snapshot it, and restore it in your homelab's AWS account — but rsync is simpler.
|
|
|
|
## Full Reset (start over)
|
|
|
|
```bash
|
|
psql $DATABASE_URL -c "TRUNCATE hosts CASCADE;"
|
|
# Then run from Stage 1
|
|
```
|
|
|
|
## Future: Automated Monthly Run
|
|
|
|
For automating with new Common Crawl releases, wrap stages 1-6 in a script that:
|
|
1. Truncates old data
|
|
2. Runs each stage, checking exit codes
|
|
3. Compares stats against previous run (alert if coverage drops significantly)
|
|
4. Deploys only if all stages succeed
|
|
5. Sends notification on completion or failure
|
|
|
|
The pipeline is already designed for this — each stage writes stats JSON, uses CLI flags for configuration, and is resumable on failure.
|