updated pipeline README to use compression and new flow

This commit is contained in:
Joe Lothan 2026-05-20 11:54:48 -04:00
parent 6352b9253f
commit 1df9a234cf

View file

@ -34,15 +34,15 @@ Populates the `hosts` table from Common Crawl's columnar index.
Fetches WARC records from CC's S3, extracts titles, icons, and iframe headers.
```bash
./warc_parse --db "$DATABASE_URL" --log-file warc_parse.log
./warc_parse --db "$DATABASE_URL" --log-file warc_parse.log --log-errors-only
```
## Stage 3: Icon Download
Downloads favicons from the live web, validates, uploads to S3.
Downloads favicons from the live web, validates, downloads to disk.
```bash
./icon_download --db "$DATABASE_URL" --log-file icon_download.log --log-errors-only
./icon_download --db "$DATABASE_URL" --log-file icon_download.log --icons-dir icons/ --log-errors-only
```
## Stage 4: Best Icon Selection
@ -86,19 +86,22 @@ After the site is deployed and verified, backup data before tearing down scannin
**Backup commands:**
Use `-z` for compression — reduces bytes on the wire which reduces AWS outbound egress costs ($0.09/GB). Icons are already compressed formats (PNG/ICO) so savings are ~5-10%, but on 500GB that's $2-4.
```bash
# 1. Database dump (run on EC2, fast — dumps to local disk first)
pg_dump -Fc $DATABASE_URL > ~/everytab_dump.pgfc
# 2. Transfer database to homelab
rsync -avP ~/everytab_dump.pgfc homelab:/backups/everytab/
rsync -avPz ~/everytab_dump.pgfc homelab:/backups/everytab/
# 3. Transfer icons to homelab (this is the big one — 500GB+, will take hours)
rsync -avP ~/icons/ homelab:/backups/everytab/icons/
# --ignore-existing skips icons already on homelab (for incremental monthly backups)
rsync -avPz --ignore-existing ~/icons/ homelab:/backups/everytab/icons/
# 4. Transfer stats and logs
rsync -avP ~/stats/ homelab:/backups/everytab/stats/
rsync -avP ~/*.log homelab:/backups/everytab/logs/
rsync -avPz ~/stats/ homelab:/backups/everytab/stats/
rsync -avPz ~/*.log homelab:/backups/everytab/logs/
```
**Verify on homelab:**