added initial pipeline README
This commit is contained in:
parent
921f72d2aa
commit
a977a8c0b3
1 changed files with 102 additions and 0 deletions
102
pipeline/README.md
Normal file
102
pipeline/README.md
Normal file
|
|
@ -0,0 +1,102 @@
|
||||||
|
# Pipeline
|
||||||
|
|
||||||
|
Run these stages in order on the EC2 instance. Each stage is a single command.
|
||||||
|
|
||||||
|
Between stages, check the validation queries to confirm data looks right before proceeding. All stages are idempotent — safe to re-run if interrupted.
|
||||||
|
|
||||||
|
## Prerequisites
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Database URL in environment
|
||||||
|
export DATABASE_URL='postgres://everytab:PASS@RDS_ENDPOINT:5432/everytab'
|
||||||
|
|
||||||
|
# Schema created
|
||||||
|
psql $DATABASE_URL -f pipeline/01_cc_index/schema.sql
|
||||||
|
|
||||||
|
# Go binaries built and on the EC2 instance
|
||||||
|
# (build locally with CGO_ENABLED=0 GOOS=linux GOARCH=amd64, then scp)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Stage 1: CC-Index Query
|
||||||
|
|
||||||
|
Populates the `hosts` table from Common Crawl's columnar index.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./pipeline/01_cc_index/query.sh --db-url "$DATABASE_URL" --limit 100000
|
||||||
|
# Full run: --limit 0
|
||||||
|
```
|
||||||
|
|
||||||
|
**Check:** `psql $DATABASE_URL -c "SELECT COUNT(*), protocol FROM hosts GROUP BY protocol;"`
|
||||||
|
|
||||||
|
## Stage 2: WARC Parsing
|
||||||
|
|
||||||
|
Fetches WARC records from CC's S3, extracts titles, icons, and iframe headers.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./warc_parse --db "$DATABASE_URL" --log-file warc_parse.log
|
||||||
|
```
|
||||||
|
|
||||||
|
**Check:** `psql $DATABASE_URL -c "SELECT COUNT(*) FROM hosts WHERE parsed = TRUE; SELECT COUNT(*) FROM icons;"`
|
||||||
|
|
||||||
|
## Stage 3: Icon Download
|
||||||
|
|
||||||
|
Downloads favicons from the live web, validates, uploads to S3.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./icon_download --db "$DATABASE_URL" --log-file icon_download.log --log-errors-only
|
||||||
|
```
|
||||||
|
|
||||||
|
**Check:** `psql $DATABASE_URL -c "SELECT scan_state, COUNT(*) FROM icons GROUP BY scan_state;"`
|
||||||
|
|
||||||
|
## Stage 4: Best Icon Selection
|
||||||
|
|
||||||
|
Picks the best icon per host for display.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
psql $DATABASE_URL -f pipeline/04_best_icon/select.sql
|
||||||
|
```
|
||||||
|
|
||||||
|
## Stage 5: Bundle Generation
|
||||||
|
|
||||||
|
Converts icons to PNG, assembles JSON bundles, uploads to S3.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./bundle_gen --db "$DATABASE_URL" --log-file bundle_gen.log --log-errors-only
|
||||||
|
```
|
||||||
|
|
||||||
|
**Check:** `aws s3 ls s3://everytab-site/tabs/ | wc -l`
|
||||||
|
|
||||||
|
Note the `TOTAL_BUNDLES` number from the summary — this gets baked into the frontend.
|
||||||
|
|
||||||
|
## Stage 6: Frontend Deploy
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Update TOTAL_BUNDLES in index.html
|
||||||
|
sed -i "s/const TOTAL_BUNDLES = .*/const TOTAL_BUNDLES = NUM;/" frontend/index.html
|
||||||
|
|
||||||
|
# Upload frontend files
|
||||||
|
aws s3 cp frontend/index.html s3://everytab-site/
|
||||||
|
aws s3 cp frontend/site.js s3://everytab-site/
|
||||||
|
aws s3 cp frontend/bot.html s3://everytab-site/
|
||||||
|
|
||||||
|
# Invalidate CloudFront cache (if configured)
|
||||||
|
# aws cloudfront create-invalidation --distribution-id DIST_ID --paths "/*"
|
||||||
|
```
|
||||||
|
|
||||||
|
## Full Reset (start over)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
psql $DATABASE_URL -c "TRUNCATE hosts CASCADE;"
|
||||||
|
# Then run from Stage 1
|
||||||
|
```
|
||||||
|
|
||||||
|
## Future: Automated Monthly Run
|
||||||
|
|
||||||
|
For automating with new Common Crawl releases, wrap stages 1-6 in a script that:
|
||||||
|
1. Truncates old data
|
||||||
|
2. Runs each stage, checking exit codes
|
||||||
|
3. Compares stats against previous run (alert if coverage drops significantly)
|
||||||
|
4. Deploys only if all stages succeed
|
||||||
|
5. Sends notification on completion or failure
|
||||||
|
|
||||||
|
The pipeline is already designed for this — each stage writes stats JSON, uses CLI flags for configuration, and is resumable on failure.
|
||||||
Loading…
Add table
Add a link
Reference in a new issue