added query.sh to read the cc-index from s3 parquet files and dump it into our psql db

This commit is contained in:
Joe Lothan 2026-05-17 19:12:25 -04:00
parent 65d2757527
commit db81015e0b
3 changed files with 179 additions and 23 deletions

View file

@ -171,7 +171,7 @@ WHERE url_path = '/'
AND fetch_status = 200
AND url_query IS NULL
AND url_protocol IN ('http', 'https')
AND url_port IN (80, 443)
AND url_port IS NULL
```
**Deduplication:** Per hostname, prefer `https` over `http`. Result is one row per unique hostname.