From 1df9a234cf60b848dafa85adef3649dc756c2606 Mon Sep 17 00:00:00 2001 From: Joe Lothan Date: Wed, 20 May 2026 11:54:48 -0400 Subject: [PATCH] updated pipeline README to use compression and new flow --- pipeline/README.md | 17 ++++++++++------- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/pipeline/README.md b/pipeline/README.md index b56f0f5..354a0fd 100644 --- a/pipeline/README.md +++ b/pipeline/README.md @@ -34,15 +34,15 @@ Populates the `hosts` table from Common Crawl's columnar index. Fetches WARC records from CC's S3, extracts titles, icons, and iframe headers. ```bash -./warc_parse --db "$DATABASE_URL" --log-file warc_parse.log +./warc_parse --db "$DATABASE_URL" --log-file warc_parse.log --log-errors-only ``` ## Stage 3: Icon Download -Downloads favicons from the live web, validates, uploads to S3. +Downloads favicons from the live web, validates, downloads to disk. ```bash -./icon_download --db "$DATABASE_URL" --log-file icon_download.log --log-errors-only +./icon_download --db "$DATABASE_URL" --log-file icon_download.log --icons-dir icons/ --log-errors-only ``` ## Stage 4: Best Icon Selection @@ -86,19 +86,22 @@ After the site is deployed and verified, backup data before tearing down scannin **Backup commands:** +Use `-z` for compression — reduces bytes on the wire which reduces AWS outbound egress costs ($0.09/GB). Icons are already compressed formats (PNG/ICO) so savings are ~5-10%, but on 500GB that's $2-4. + ```bash # 1. Database dump (run on EC2, fast — dumps to local disk first) pg_dump -Fc $DATABASE_URL > ~/everytab_dump.pgfc # 2. Transfer database to homelab -rsync -avP ~/everytab_dump.pgfc homelab:/backups/everytab/ +rsync -avPz ~/everytab_dump.pgfc homelab:/backups/everytab/ # 3. Transfer icons to homelab (this is the big one — 500GB+, will take hours) -rsync -avP ~/icons/ homelab:/backups/everytab/icons/ +# --ignore-existing skips icons already on homelab (for incremental monthly backups) +rsync -avPz --ignore-existing ~/icons/ homelab:/backups/everytab/icons/ # 4. Transfer stats and logs -rsync -avP ~/stats/ homelab:/backups/everytab/stats/ -rsync -avP ~/*.log homelab:/backups/everytab/logs/ +rsync -avPz ~/stats/ homelab:/backups/everytab/stats/ +rsync -avPz ~/*.log homelab:/backups/everytab/logs/ ``` **Verify on homelab:**