added query.sh to read the cc-index from s3 parquet files and dump it into our psql db

This commit is contained in:
Joe Lothan 2026-05-17 19:12:25 -04:00
parent 65d2757527
commit db81015e0b
3 changed files with 179 additions and 23 deletions

44
PLAN.md
View file

@ -102,32 +102,15 @@ CREATE INDEX idx_icons_host_id ON icons(host_id);
**Done when:** Tables exist in RDS, schema matches ARCHITECTURE.md.
### Step 1.2: DuckDB CC-Index Query (100K limit)
### Step 1.2: DuckDB CC-Index Query (100K limit) [COMPLETED]
Write `pipeline/01_cc_index/query.sql` (or a shell script wrapping DuckDB CLI).
Script: `pipeline/01_cc_index/query.sh`
The script:
1. Connects DuckDB to RDS via the postgres extension
2. Queries the CC-Index parquet files via httpfs (latest crawl)
3. Filters per ARCHITECTURE.md criteria
4. Deduplicates per hostname (prefer https)
5. Limits to 100,000 rows for dev
6. Inserts directly into the hosts table
Uses DuckDB with `aws` extension (credential chain) to read parquet directly from `s3://commoncrawl/.../*.parquet` glob, with the `postgres` extension to write results into RDS. Auto-detects latest crawl ID from the CC API.
Key considerations:
- Find the latest crawl index path (e.g., `s3://commoncrawl/cc-index/collections/CC-MAIN-2026-05/indexes/cdx-00*.parquet` — verify actual path structure)
- DuckDB postgres extension: `INSTALL postgres; LOAD postgres; ATTACH 'dbname=... host=... ...' AS pg (TYPE POSTGRES);`
- The dedup logic: partition by hostname, order by protocol (https first), take first row
- Add `LIMIT 100000` for dev, remove for full run
- Time the query — if httpfs takes >1hr, switch to downloading parquet first
Deduplication via `GROUP BY url_host_name` with `first(... ORDER BY ...)` aggregates (hash aggregation — more memory-efficient than window functions).
**Validation:**
- `SELECT COUNT(*) FROM hosts;` returns ~100,000
- `SELECT protocol, COUNT(*) FROM hosts GROUP BY protocol;` shows mostly https
- `SELECT * FROM hosts LIMIT 5;` shows sane data (real hostnames, valid WARC paths)
- Spot-check: pick a few hostnames, verify they're real websites
**Stats to emit:** `stats/01_cc_index.json` — includes: started_at, finished_at, duration_seconds, total_domains, https_count, http_count.
**Result:** 100K hosts, 77% https / 23% http, completed in 692s.
**Done when:** 100K hosts in the database with valid WARC coordinates.
@ -694,3 +677,20 @@ On completion, each program prints a summary line and writes its stats JSON (wit
- Amazon Linux 2023 uses `systemd-resolved` which manages `/etc/resolv.conf`. Must disable it before pointing resolv.conf at Unbound. `chattr +i` doesn't work on the symlink.
- AWS EC2 key pairs created via API don't support passphrases. Use `tls_private_key` in Terraform or generate locally with `ssh-keygen` + import.
- When an AWS key pair name already exists from a previous run, Terraform may not regenerate it. Use `-replace` to force recreation of the key + instance together.
### Phase 1 (Steps 1.1-1.2) — Completed 2026-05-17
**Changes from original plan:**
- Used DuckDB `aws` extension with `CREDENTIAL_CHAIN` instead of httpfs anonymous access. The commoncrawl S3 bucket requires authenticated requests.
- IAM role needed explicit `s3:GetObject` and `s3:ListBucket` on `arn:aws:s3:::commoncrawl/*` — the bucket doesn't allow cross-account access based on bucket policy alone.
- Used `GROUP BY` with `first(... ORDER BY ...)` instead of `ROW_NUMBER()` window function. More memory-efficient (hash aggregation vs sort), cleaner syntax.
- DuckDB can glob `s3://.../subset=warc/*.parquet` directly (300 files) — no need to fetch a file list or download parquet locally.
- Dropped the `url_port IN (80, 443)` filter — CC stores standard ports as NULL, not 80/443. Replaced with `url_port IS NULL`.
**Lessons learned:**
- DuckDB URL-encodes `=` in S3 paths (e.g., `crawl%3DCC-MAIN-2026-17`) but S3 decodes it correctly. The real issue was always IAM permissions, not path encoding.
- The `commoncrawl` S3 bucket requires valid AWS credentials for both GetObject and ListBucket. Anonymous access (unsigned requests) does not work. Any valid IAM identity works as long as their policy allows it.
- DuckDB's LIMIT can interact unexpectedly with GROUP BY — the optimizer may stop reading input early once it has enough groups. This wasn't our issue (it was the port filter) but worth noting for future queries.
- CC-Index stores `url_port` as NULL for standard ports (80/443), not as the integer. Always check actual column values before writing filters.
- c5.xlarge (8GB) is tight for this query — uses 6.4GB + swap. For the full 30M run, use c5.2xlarge (16GB).
- Query takes ~692s (11.5 min) for 100K output rows reading all 300 parquet files. Full run without LIMIT will be similar duration but more memory for the hash table.