updated pipeline README to use compression and new flow
This commit is contained in:
parent
6352b9253f
commit
1df9a234cf
1 changed files with 10 additions and 7 deletions
|
|
@ -34,15 +34,15 @@ Populates the `hosts` table from Common Crawl's columnar index.
|
||||||
Fetches WARC records from CC's S3, extracts titles, icons, and iframe headers.
|
Fetches WARC records from CC's S3, extracts titles, icons, and iframe headers.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
./warc_parse --db "$DATABASE_URL" --log-file warc_parse.log
|
./warc_parse --db "$DATABASE_URL" --log-file warc_parse.log --log-errors-only
|
||||||
```
|
```
|
||||||
|
|
||||||
## Stage 3: Icon Download
|
## Stage 3: Icon Download
|
||||||
|
|
||||||
Downloads favicons from the live web, validates, uploads to S3.
|
Downloads favicons from the live web, validates, downloads to disk.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
./icon_download --db "$DATABASE_URL" --log-file icon_download.log --log-errors-only
|
./icon_download --db "$DATABASE_URL" --log-file icon_download.log --icons-dir icons/ --log-errors-only
|
||||||
```
|
```
|
||||||
|
|
||||||
## Stage 4: Best Icon Selection
|
## Stage 4: Best Icon Selection
|
||||||
|
|
@ -86,19 +86,22 @@ After the site is deployed and verified, backup data before tearing down scannin
|
||||||
|
|
||||||
**Backup commands:**
|
**Backup commands:**
|
||||||
|
|
||||||
|
Use `-z` for compression — reduces bytes on the wire which reduces AWS outbound egress costs ($0.09/GB). Icons are already compressed formats (PNG/ICO) so savings are ~5-10%, but on 500GB that's $2-4.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# 1. Database dump (run on EC2, fast — dumps to local disk first)
|
# 1. Database dump (run on EC2, fast — dumps to local disk first)
|
||||||
pg_dump -Fc $DATABASE_URL > ~/everytab_dump.pgfc
|
pg_dump -Fc $DATABASE_URL > ~/everytab_dump.pgfc
|
||||||
|
|
||||||
# 2. Transfer database to homelab
|
# 2. Transfer database to homelab
|
||||||
rsync -avP ~/everytab_dump.pgfc homelab:/backups/everytab/
|
rsync -avPz ~/everytab_dump.pgfc homelab:/backups/everytab/
|
||||||
|
|
||||||
# 3. Transfer icons to homelab (this is the big one — 500GB+, will take hours)
|
# 3. Transfer icons to homelab (this is the big one — 500GB+, will take hours)
|
||||||
rsync -avP ~/icons/ homelab:/backups/everytab/icons/
|
# --ignore-existing skips icons already on homelab (for incremental monthly backups)
|
||||||
|
rsync -avPz --ignore-existing ~/icons/ homelab:/backups/everytab/icons/
|
||||||
|
|
||||||
# 4. Transfer stats and logs
|
# 4. Transfer stats and logs
|
||||||
rsync -avP ~/stats/ homelab:/backups/everytab/stats/
|
rsync -avPz ~/stats/ homelab:/backups/everytab/stats/
|
||||||
rsync -avP ~/*.log homelab:/backups/everytab/logs/
|
rsync -avPz ~/*.log homelab:/backups/everytab/logs/
|
||||||
```
|
```
|
||||||
|
|
||||||
**Verify on homelab:**
|
**Verify on homelab:**
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue