added query.sh to read the cc-index from s3 parquet files and dump it into our psql db
This commit is contained in:
parent
65d2757527
commit
db81015e0b
3 changed files with 179 additions and 23 deletions
|
|
@ -171,7 +171,7 @@ WHERE url_path = '/'
|
|||
AND fetch_status = 200
|
||||
AND url_query IS NULL
|
||||
AND url_protocol IN ('http', 'https')
|
||||
AND url_port IN (80, 443)
|
||||
AND url_port IS NULL
|
||||
```
|
||||
|
||||
**Deduplication:** Per hostname, prefer `https` over `http`. Result is one row per unique hostname.
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue