added query.sh to read the cc-index from s3 parquet files and dump it into our psql db

2026-05-17 19:12:25 -04:00 · 2026-05-17 19:12:25 -04:00 · db81015e0b
commit db81015e0b
parent 65d2757527
3 changed files with 179 additions and 23 deletions
--- a/PLAN.md
+++ b/PLAN.md
@ -102,32 +102,15 @@ CREATE INDEX idx_icons_host_id ON icons(host_id);

 **Done when:** Tables exist in RDS, schema matches ARCHITECTURE.md.

-### Step 1.2: DuckDB CC-Index Query (100K limit)
+### Step 1.2: DuckDB CC-Index Query (100K limit) [COMPLETED]

-Write `pipeline/01_cc_index/query.sql` (or a shell script wrapping DuckDB CLI).
+Script: `pipeline/01_cc_index/query.sh`

-The script:
-1. Connects DuckDB to RDS via the postgres extension
-2. Queries the CC-Index parquet files via httpfs (latest crawl)
-3. Filters per ARCHITECTURE.md criteria
-4. Deduplicates per hostname (prefer https)
-5. Limits to 100,000 rows for dev
-6. Inserts directly into the hosts table
+Uses DuckDB with `aws` extension (credential chain) to read parquet directly from `s3://commoncrawl/.../*.parquet` glob, with the `postgres` extension to write results into RDS. Auto-detects latest crawl ID from the CC API.

-Key considerations:
- Find the latest crawl index path (e.g., `s3://commoncrawl/cc-index/collections/CC-MAIN-2026-05/indexes/cdx-00*.parquet` — verify actual path structure)
- DuckDB postgres extension: `INSTALL postgres; LOAD postgres; ATTACH 'dbname=... host=... ...' AS pg (TYPE POSTGRES);`
- The dedup logic: partition by hostname, order by protocol (https first), take first row
- Add `LIMIT 100000` for dev, remove for full run
- Time the query — if httpfs takes >1hr, switch to downloading parquet first
+Deduplication via `GROUP BY url_host_name` with `first(... ORDER BY ...)` aggregates (hash aggregation — more memory-efficient than window functions).

-**Validation:**
- `SELECT COUNT(*) FROM hosts;` returns ~100,000
- `SELECT protocol, COUNT(*) FROM hosts GROUP BY protocol;` shows mostly https
- `SELECT * FROM hosts LIMIT 5;` shows sane data (real hostnames, valid WARC paths)
- Spot-check: pick a few hostnames, verify they're real websites
-
-**Stats to emit:** `stats/01_cc_index.json` — includes: started_at, finished_at, duration_seconds, total_domains, https_count, http_count.
+**Result:** 100K hosts, 77% https / 23% http, completed in 692s.

 **Done when:** 100K hosts in the database with valid WARC coordinates.

@ -694,3 +677,20 @@ On completion, each program prints a summary line and writes its stats JSON (wit
 - Amazon Linux 2023 uses `systemd-resolved` which manages `/etc/resolv.conf`. Must disable it before pointing resolv.conf at Unbound. `chattr +i` doesn't work on the symlink.
 - AWS EC2 key pairs created via API don't support passphrases. Use `tls_private_key` in Terraform or generate locally with `ssh-keygen` + import.
 - When an AWS key pair name already exists from a previous run, Terraform may not regenerate it. Use `-replace` to force recreation of the key + instance together.
+
+### Phase 1 (Steps 1.1-1.2) — Completed 2026-05-17
+
+**Changes from original plan:**
+- Used DuckDB `aws` extension with `CREDENTIAL_CHAIN` instead of httpfs anonymous access. The commoncrawl S3 bucket requires authenticated requests.
+- IAM role needed explicit `s3:GetObject` and `s3:ListBucket` on `arn:aws:s3:::commoncrawl/*` — the bucket doesn't allow cross-account access based on bucket policy alone.
+- Used `GROUP BY` with `first(... ORDER BY ...)` instead of `ROW_NUMBER()` window function. More memory-efficient (hash aggregation vs sort), cleaner syntax.
+- DuckDB can glob `s3://.../subset=warc/*.parquet` directly (300 files) — no need to fetch a file list or download parquet locally.
+- Dropped the `url_port IN (80, 443)` filter — CC stores standard ports as NULL, not 80/443. Replaced with `url_port IS NULL`.
+
+**Lessons learned:**
+- DuckDB URL-encodes `=` in S3 paths (e.g., `crawl%3DCC-MAIN-2026-17`) but S3 decodes it correctly. The real issue was always IAM permissions, not path encoding.
+- The `commoncrawl` S3 bucket requires valid AWS credentials for both GetObject and ListBucket. Anonymous access (unsigned requests) does not work. Any valid IAM identity works as long as their policy allows it.
+- DuckDB's LIMIT can interact unexpectedly with GROUP BY — the optimizer may stop reading input early once it has enough groups. This wasn't our issue (it was the port filter) but worth noting for future queries.
+- CC-Index stores `url_port` as NULL for standard ports (80/443), not as the integer. Always check actual column values before writing filters.
+- c5.xlarge (8GB) is tight for this query — uses 6.4GB + swap. For the full 30M run, use c5.2xlarge (16GB).
+- Query takes ~692s (11.5 min) for 100K output rows reading all 300 parquet files. Full run without LIMIT will be similar duration but more memory for the hash table.