Commit graph

72 commits

Author SHA1 Message Date
ead6366ed0 up ulimit for more connection 2026-05-20 10:18:48 -04:00
6d8ba61102 update warc parsing with new 3 stage producer, worker, consumer model, increasing speed and saturating cores 2026-05-20 10:18:15 -04:00
0efec72e45 print every 100 bundles 2026-05-20 10:17:35 -04:00
426abe1c90 upped concurrency of icon downloading 2026-05-20 09:47:18 -04:00
3bc355e503 improved bundle cli output with progress 2026-05-20 09:46:59 -04:00
86cff37533 download cc-index to home not tmp (which is tmpfs) 2026-05-20 09:35:06 -04:00
9308b5e039 download cc-index first with aws cli instead of streaming it 2026-05-20 08:14:22 -04:00
564919c5cc added downloaded_at timestamp to icon table 2026-05-20 01:35:13 -04:00
ec33b2e857 bump up s3 warc retries to 6 to avoid 503 errors 2026-05-20 01:30:46 -04:00
081866f62e update bundle gen to use channels and goroutines to saturate disk and not block on db access + bundle coalesing and uploading 2026-05-20 01:28:52 -04:00
902928235c updated best icon selection logic 2026-05-20 01:15:08 -04:00
03e343a136 cap number of favicons to 50 per host 2026-05-20 00:53:24 -04:00
cd896427eb shuffle icon link batches before putting them in the channel 2026-05-20 00:50:40 -04:00
27203ff085 updated bot rate 2026-05-20 00:50:17 -04:00
963d9209ca cleaner dns error handling 2026-05-20 00:35:55 -04:00
c9ea462e97 check all CSP headers for iframe disallowing 2026-05-20 00:32:56 -04:00
a8177a1583 improve stats generation 2026-05-20 00:31:38 -04:00
0c9ad5bfd6 count iframes only if there isn't an error 2026-05-20 00:29:28 -04:00
3264288752 capped random favicons for frontend at 100 2026-05-20 00:17:12 -04:00
56ae26cbef added bmp decoder to bundler 2026-05-20 00:11:53 -04:00
7d24b406aa redundant min 2026-05-20 00:10:04 -04:00
eb40995c60 just overwrite bundles, don't delete then re-add 2026-05-20 00:09:53 -04:00
d6ef34a1dc go mod tidy 2026-05-20 00:07:48 -04:00
258c6c5f3a updated ARCHITECTURE.md 2026-05-19 23:46:06 -04:00
2f1547a912 switched bundle host field to url to retain http 2026-05-19 23:38:14 -04:00
7f36e99443 updated random value to double precision float 2026-05-19 23:37:50 -04:00
41c0eb5c49 updated PLAN.md with another 3M run to test code changes 2026-05-19 13:42:19 -04:00
a28cd2b056 updated pipeline README 2026-05-19 13:06:48 -04:00
1b665d1065 it's web, not Web according to APA style guides 2026-05-19 12:25:45 -04:00
4ceefeec3d fixed typo 2026-05-19 11:50:49 -04:00
f7f564289c drop the space, it's cleaner 2026-05-19 11:49:13 -04:00
3534f84b27 added about.html 2026-05-19 11:42:09 -04:00
1d5b7bd374 added random_order to host table schema 2026-05-19 10:47:05 -04:00
e6d5d5175c fixed oom in bundle_gen and added randomOrder, still need a full redesign 2026-05-19 10:46:40 -04:00
cf17fc42b1 fixed icon downloading performance issues 2026-05-19 10:32:34 -04:00
2745e75408 updated infra README about pinning AMI 2026-05-19 10:23:29 -04:00
2f4e5b585d updated PLAN.md for future plans 2026-05-19 08:34:42 -04:00
85b663a6e8 added logging to cloudfront 2026-05-18 13:42:49 -04:00
c7e33defa2 updated plan.md, finished integration test 2026-05-18 12:49:34 -04:00
5b3f6a6870 switched from s3 to disk for saving icons 2026-05-18 12:43:50 -04:00
113a261dae updated duckdb and added a swap file 2026-05-18 02:10:15 -04:00
4436f43c6f force destroy bucket with icons 2026-05-18 01:21:03 -04:00
ddeb8bc504 fix TOTAL_BUNDLES sed command in deploy script 2026-05-18 01:00:09 -04:00
21f2a75ed3 delete old tab bundles before making new ones 2026-05-18 00:49:50 -04:00
a977a8c0b3 added initial pipeline README 2026-05-18 00:40:57 -04:00
921f72d2aa added deploy script 2026-05-18 00:40:27 -04:00
2bdb71a47a added bot.html for scanning 2026-05-18 00:40:20 -04:00
f64b93b229 random favicon selection changing 2026-05-18 00:39:22 -04:00
e5035d9a28 updated PLAN.md, finished phase 5 2026-05-18 00:26:50 -04:00
4963866427 updated scanning useragent 2026-05-18 00:26:13 -04:00