don't use zdns, just use a local unbound to make things easier

This commit is contained in:
Joe Lothan 2026-05-17 12:18:47 -04:00
parent f6ec08535f
commit 8ef465b2a4

View file

@ -11,7 +11,7 @@ there is conceptually two stages to the project: the scanning phase and the host
- download the cc-index to the ec2 instance
- use duckdb with the postgres extension to do a one shot query, creating a table called "hosts" that only include relevant rows and columns in the cc-index
- read and parse the WARCs (defensively, accounting for malformed html) for the http "title" information (to update the "hosts" database) and the rel="icon" information (to create a new database called "icons" that foreign key points to the correct "hosts" row, after all one page can have multiple icons)
- from the "icons" database, parse the url and use zdns or another fast recursive resolver to resolve all dns names, so on scan time it doesn't overload upstream recursive resolvers, updating the "icons" table
- run a recursive dns resolver, so dns does not go upstream but instead hits a local dns server so we don't annoy upstream resolvers. use caching and a min-ttl to ensure when running the entire scan we reuse tlds and even popular subdomains's resolutions.
- download all the icons from the "icons" db, saving them into an s3 bucket named "everytab" in a folder named "icons" with a unique id (maybe a hash of the contents but maybe something else). update the "icons" database table with scan status and hash once it's found.
- last phase it to use this data to prep for the hosting phase
- hosting phase
@ -28,7 +28,7 @@ The scale and heterogenity of the data makes this a challenge. We can piggy pack
runs will take multiple hours and ensuring correctness of the data pipeline, ensuring infrastructure can handle the bandwidth and size, and that we capture statstics throughout the run as a sanity check.
Statistics about errors or loss between each step of the workflow would be especially useful. printing some of them out for independent investigation will help squash bugs.
Testing each part with a "dry run" that just does say, like 10,000 rows and doesn't make any changes (but prints the changes it will make to the command line) seems like a good way to do testing (though if there are better, let me know)
In fact a good strategy for rollout will be to cap the number of domains we get from the cc-index to 100,000 at the start, to have a smaller dataset to build and test this workflow and site end to end before scaling up to the full 30 million domains.
Though also keeping in mind we are fine if we don't capture every single site. Capturing even 90% would be a win, and we can iterate to get it to 99% on the next crawl. (naturally there will be some loss as common crawl only happens once a month and sites go up and down in between crawl release and our captures) also we want the effect to *feel* like there is every site on there, the randomness of it's display will mean that it is hard to check if literally every site is there (though we do once again, want to support as many as is easy to start off with)
on that note we want to do iterative development
@ -45,7 +45,7 @@ for more specification, notes, thoughts, etc
- output to the hosts table:
- postgres columns should be: hostname, ip, crawl, protocol (http or https), warc_filename, warc_record_offset, warc_record_length, warc_segments, favicon_hash, html_title
- use duckdb with a postgres extension with a one shot query to take the data I want and put it in the db
- use zdns to resolve all domain names to ips, (or some better option for resolving domain names - search for one!) but we don't want to forward any request to an upstream resolver, we want to resolve it ourselves so we don't get throttled
- use unbound to resolve all domain names to ips, (or some better option for resolving domain names - search for one!) but we don't want to forward any request to an upstream resolver, we want to resolve it ourselves so we don't get throttled
- we want to use custom high performance web golang code so the warc reading and parsing is fast and saturates all cores / bandwidth
- we want to also use cusom golang program to package up our found, valid (host, icon, title) jsons, also ideally multithreaded, high perfomance
- there is a big question about heterogenetiy of favicon types and how much and what to store. Id like to allow for multiple favicons specified. We should include a /favicon.ico for all sites (with the same protocol: http/https as the host) whether they have link rel="icon" set or not
@ -55,3 +55,5 @@ for more specification, notes, thoughts, etc
- for each host, we want https first, falling back to http. ideally one line per host.
- we want the frontend to be simple and snappy with good UI/UX. we want it to be beautiful.
- potentially we can even read useragent hosts and have different styles to mimic the tab styles of the hosts browser. that might be a later feature, for now keep it simple
- keep stats between each phase so we can see our "loss" from the original list of all domains to the ones that are actually included in hte final site. we can see how many don't resolve their domains, how many contain unparsable html or unextractable html title tags, don't have favicons or what.
- still undecided on whether or not we should include sites that do not have favicons