everytab/design.md

10 KiB

this is the repo for a website that I want to make that can display every tab on the internet. to do that we'll need to construct a database that conceptually is a tuple: (domain/ip, html title, favicon) to be able to render the tabs it works via using common crawl's database of the entire web, parsing the hosts, individual website titles, and favicon location (<rel="link">, or if not found, assuming /favicon.ico) and stores it in a postgres database. then it downloads every single favicon with a link to the object storage in the database.

I want to do this entirely in aws, us-east-1, with a low running cost, ideally under $100 a month. there is conceptually two stages to the project: the scanning phase and the hosting phase.

  • scanning phase
    • the scanning phase has 3 resources: an xlarge ec2 instance, an RDS database, and an s3 bucket for storing the icons (called say, icons)
    • this scanning phase has multiple parts:
      • download the cc-index to the ec2 instance
      • use duckdb with the postgres extension to do a one shot query, creating a table called "hosts" that only include relevant rows and columns in the cc-index
      • read and parse the WARCs (defensively, accounting for malformed html) for the http "title" information (to update the "hosts" database) and the rel="icon" information (to create a new database called "icons" that foreign key points to the correct "hosts" row, after all one page can have multiple icons)
      • run a recursive dns resolver, so dns does not go upstream but instead hits a local dns server so we don't annoy upstream resolvers. use caching and a min-ttl to ensure when running the entire scan we reuse tlds and even popular subdomains's resolutions.
      • download all the icons from the "icons" db, saving them into an s3 bucket named "everytab" in a folder named "icons" with a unique id (maybe a hash of the contents but maybe something else). update the "icons" database table with scan status and hash once it's found.
      • last phase it to use this data to prep for the hosting phase
  • hosting phase
    • after scanning, the ec2 instance should prepare for the hosting phase. this is where we use the database information and the icon data in the s3 to construct a website that can display all of the icons in a cool way: a website that has "every tab"
    • the actual hosting of the website is "static" - it will just use s3 and cloudfront to reduce costs. this final "preparation" the ec2 instance should do before turning off is to prepare the site in a way to make it easier to serve statically.
    • this will work by first finding the best icon for each site, the largest file (under 64x64) and storing it in the "hosts" database
    • then it will take each line of the hosts database, randomize them, and chunk them into chunks of hosts of size N, where we tune N so that the artifacts it makes is ~1MB.
    • these artifacts it creates is just one big json with a list of N of these (host, title, icon) and maybe a bit more as needed (content type for icon) with the icon stored inline, as say, base64 (or json safe base85) and it stores these jsons in the s3 bucket under tabs/{n}.json
    • the frontend works like this: a single index.html with inline css and javascript, javascript takes in a random number seeded with the current time, and generates a random integer between 0 and (total_hosts / N) and picks a random bundle of tabs in the tabs s3 folder, and renders them so it looks like the whole page is covered in tabs. When the user clicks on the tab, it opens up a iframe and actually visits the page (as it would in a browser if you clicked the tab). it also allows the user to click again to stop showing the iframe, click X to close the tab, or scrole down, where it calls out for more random tab bundles as needed (careful not to see the same one it already seen)

we want the frontend to be hit with as few requests as possible to keep cloudfront costs down low (hence this design). This one is only 2 requests per initial page visit, but if it's difficult to design the javascript frontend we can probably do more requests (but we still want to keep it small)

The scale and heterogenity of the data makes this a challenge. We can piggy pack upon the work common crawl has done but only so much. runs will take multiple hours and ensuring correctness of the data pipeline, ensuring infrastructure can handle the bandwidth and size, and that we capture statstics throughout the run as a sanity check. Statistics about errors or loss between each step of the workflow would be especially useful. printing some of them out for independent investigation will help squash bugs. Testing each part with a "dry run" that just does say, like 10,000 rows and doesn't make any changes (but prints the changes it will make to the command line) seems like a good way to do testing (though if there are better, let me know) In fact a good strategy for rollout will be to cap the number of domains we get from the cc-index to 100,000 at the start, to have a smaller dataset to build and test this workflow and site end to end before scaling up to the full 30 million domains. Though also keeping in mind we are fine if we don't capture every single site. Capturing even 90% would be a win, and we can iterate to get it to 99% on the next crawl. (naturally there will be some loss as common crawl only happens once a month and sites go up and down in between crawl release and our captures) also we want the effect to feel like there is every site on there, the randomness of it's display will mean that it is hard to check if literally every site is there (though we do once again, want to support as many as is easy to start off with)

on that note we want to do iterative development there is certainly a "best way" or even a "better way" to do this, but we are fine for earlier iterations of this workflow to be "run this command / shell script" or "create the IAM role in the aws cli" rather than spending our time doing state of the art cloud orchestration or database tuning. Good enough for a full end-to-end test is our initial goal, then we can move onto improvement parts of the pipeline for future crawl downloads.

Ideally we can do an partial test where we only do 1% (or .1%) of the data so we can get a sense of timing and storage and any issues that might arise from our workflow.

for more specification, notes, thoughts, etc

  • just download the latest cc-index for now, but would like to tag in the database what row comes from what crawl and make it crawl agnostic, so we can downloaded the next crawl to the same workflow
  1. The input and output of the cc-index will look something like:
  • input:
    • WHERE url_path = '/' AND content_mime_type = 'text/html' AND fetch_status = 200 AND url_query IS NULL AND url_protocol IN ('http', 'https') and url_port in (80, 443);
    • we just want the root path per domain/ip and only html files that return. so each domain/ip should have at most 2 rows in our db: http and https. even better if we default to https and if it doesn't exist, use http
  • output to the hosts table:
    • postgres columns should be: hostname, ip, crawl, protocol (http or https), warc_filename, warc_record_offset, warc_record_length, warc_segments, favicon_hash, html_title
  • use duckdb with a postgres extension with a one shot query to take the data I want and put it in the db
  • use unbound to resolve all domain names to ips, (or some better option for resolving domain names - search for one!) but we don't want to forward any request to an upstream resolver, we want to resolve it ourselves so we don't get throttled
  • we want to use custom high performance web golang code so the warc reading and parsing is fast and saturates all cores / bandwidth
  • we want to also use cusom golang program to package up our found, valid (host, icon, title) jsons, also ideally multithreaded, high perfomance
  • there is a big question about heterogenetiy of favicon types and how much and what to store. Id like to allow for multiple favicons specified. We should include a /favicon.ico for all sites (with the same protocol: http/https as the host) whether they have link rel="icon" set or not
  • maybe to start we should just use gif, png, and icon files. we also might want to parse them into the best format for small filesize when creating the tabs bundles
  • all of this should be defensively programmed - we don't want huge downloads for icons (set a max download size that is sane for icons), we want to be able to handle malformed html, either by failing or trying to recover and use a partial parse to grab the title and the icon paths, we want to deal with network issues with grabbing the favicons, they could be all over the world and take a while, requests could get lost / dropped, etc we want to balance making a good attempt to get the data we want with speed of the scan. if we can chose which rel=icons to add (say, just png, gif, ico, with the sizes we want 16x16, 32x32, 48x48, 64x64) do that filtering before adding it to the icons database
  • we want to resolve the domain names in the icons database, not the hosts database (after all it will be the requests that we actually make, to the icons, they are mostly the same as the hosts but no necessarily!) this will require parsing the icon url correctly. also we want to actually use the ips when making that request, make sure to turn of domain name resolution for the request so we don't anger upstream dns providers.
  • for each host, we want https first, falling back to http. ideally one line per host.
  • we want the frontend to be simple and snappy with good UI/UX. we want it to be beautiful.
  • potentially we can even read useragent hosts and have different styles to mimic the tab styles of the hosts browser. that might be a later feature, for now keep it simple
  • keep stats between each phase so we can see our "loss" from the original list of all domains to the ones that are actually included in hte final site. we can see how many don't resolve their domains, how many contain unparsable html or unextractable html title tags, don't have favicons or what.
  • still undecided on whether or not we should include sites that do not have favicons